How to create OMTD-compliant dockerized components

This document explains:

How an OMTD-compliant docker image that contains TDM component(s) should be built.
The external interface that the docker image must follow.

The above two are required in order to make the docker image usable within the OpenMinTeD platform; for example if a TDM component is dockerized according to the given specifications then the OMTD Workflow Service (e.g. based on Galaxy workflow execution engine) will be able to call it.

Metadata recommendations

The dockerized component must be described with an OMTD-SHARE descriptor (not contained in the Docker image). This descriptor contains all relevant information about the component such as its ID, how to obtain it via a Docker repository (i.e. the coordinates of the Docker image), the parameters of the component, etc. The OpenMinTeD platform uses this descriptor to automatically generate additional internal configuration files, e.g. to enable the Galaxy-based OpenMinTeD Workflow Service to invoke the component.

The metadata elements we will mention in the following specification are the ones required for a docker-based component to be identified, pulled, spawned and invoked into OpenMinTeD. The remaining metadata of the OMTD-SHARE descriptor must be encoded in the same way as for the non docker-based components (see details on how to describe a software). See this example of description with the OMTD schema of a https://github.com/openminted/alvis-docker/blob/master/documentation/simpleProjector_OMTD_Desc.xml[docker-based component].

<resourceIdentifier>: set the component id; for dockerized components, the name used to invoke the component.
<command>: what is used for invoking the component(s)
<inputContentResourceInfo>: set of elements with the specs for the input resource (i.e. the corpus or document) that will be processed by the component
<outputResourceInfo>: set of elements with the specs for the output resource (i.e. annotations) that will be produced by the component
<parameterInfos>: set of elements used for describing the parameters used when running a component
<distributionLocation>: the place where components can be accessed from; for dockerized components, these are the docker coordinates following the Docker conventions

Technical recommendations

A docker image for a TDM component must be self-contained, able to provide an execution of the component task into a container and provide the final outputs. If resources that change from one execution to another are used by the component, we recommend to make them available as values of parameters of the component. Otherwise, you must ensure that the appropriate and required resources are available to each component container. The docker image of the component must be available via a docker repository accessible to OMTD (e.g., Docker Hub). Most importantly, the dockerfile of the docker image must be valid according to a check list defined by the OpenMinTeD administrators.

The OMTD docker image and TDM component that it hosts must follow a set of specifications which are described below.

The docker image must contain at least one TDM component

There must be a unique name for each component. The name is responsible for invoking a TDM component or workflow.
The component name must be a parameter that invokes a unique component within the docker run command.

In the OMTD-SHARE descriptor, the component name is described using the resourceIdentifier element as in the following example(s).

Alvis Component

<resourceIdentifier resourceIdentifierSchemeName=”docker”>
  default.modules.simpleprojector
</resourceIdentifier>

UIMA/DKPro component

<resourceIdentifier resourceIdentifierSchemeName=”docker”>
  de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer
</resourceIdentifier>

The docker image must include a TDM component executable

There must be an executable (e.g. based on docker ENTRYPOINT) which is used as the entry point for the docker image.
This executable is responsible for executing TDM component(s) or workflow(s). The executable must be a parameter that invokes one of the components in the docker run command, since a docker image can contain more than one TDM components,

In the OMTD-SHARE descriptor, the executable is described using the command element as in the example(s) below. The value of the command element must contain the executable (e.g., alvisnlp). That value is a string, without any special character.

Alvis TDM executable

<command>alvisnlp</command>

DKPro TDM executable

<command>dkpro-core</command>

Reading input data and writing output data for each TDM component

The component must specify inputs and outputs as parameters to the docker run command in its OMTD-SHARE descriptor.
It must be possible to set (with a parameter named --input) the input in the docker run in the following format: --input <PATH-TO-THE-INPUT-DATA>
It must be possible to set (with a parameter named --output) the output in the docker run in the following format: --output <PATH-TO-THE-OUTPUT-DATA>
There must not be any additional parameters to indicate data input/output locations
Input and output data must be expected as files
The input and output files must be in one of the formats declared in the OMTD-SHARE descriptor
If a component declares multiple input/output formats in its OMTD-SHARE descriptor, there must be parameters called --input-format/--output-format which indicates which input/output format to use.

In the OMTD descriptor, the metadata for the input and the output are provided respectively into the inputContentResourceInfo and outputResourceInfo elements. At least values for the processingResourceTypes, dataFormats and characterEncodings sub-elements must be filled in.

<inputContentResourceInfo>
  <processingResourceTypes>
    <processingResourceType>corpus</processingResourceType>
  </processingResourceTypes>
  <dataFormats>
    <dataFormatInfo>
      <dataFormat>application/vnd.xmi+xml</dataFormat>
      <mimeType>application/vnd.xmi+xml</mimeType>
    <dataFormatInfo>
  </dataFormats>
  <characterEncodings>
    <characterEncoding>UTF8</characterEncoding>
  </characterEncodings>
  [...]
</inputContentResourceInfo>
<outputResourceInfo>
  <processingResourceTypes>
    <processingResourceType>corpus</processingResourceType>
  </processingResourceTypes>
  <dataFormats>
    <dataFormatInfo>
      <dataFormat>web annotation</dataFormat>
      <mimeType>application/json+ld</mimeType>
    <dataFormatInfo>
  </dataFormats>
  <characterEncodings>
    <characterEncoding>UTF8</characterEncoding>
  </characterEncodings>
  [...]
</outputResourceInfo>

Accepting parameters for each TDM component

If the component declares a parameter in its OMTD-SHARE descriptor, then it must be possible to specify this parameter in the docker run in the following format: --param:<PARAMETER-NAME>=<PARAMETER-VALUE>
If a parameter accepts multiple values, then these must be comma-separated
If a value contains a comma, it must be escaped using a backslash: \,
If a value contains a backslash, it must be escaped using a second backslash: \\
… any other special characters that need escaping? Can we use quoting?

In the OMTD description, the metadata for the parameters are filled in the parameterInfo element. Values for the name, parameterType, optional sub-elements are required. The following example describes a parameter.

<parameterInfo>
  <parameterName>targetlayerName</parameterName>
  <parameterLabel>target Name Layer</parameterLabel>
  <parameterDescription>
    Name of the layer that contains the match annotations.
  </parameterDescription>
  <parameterType>string</parameterType>
  <optional>true</optional>
  <multiValue>false</multiValue>
  <defaultValue>concepts</defaultValue>
</parameterInfo>

Fully identify the docker image

following the docker convention, each image must have a full tag composed of a repository name, a specific tag and a version of the image (e.g., {repository-name}/{specific-tag}:{version})

In the OMTD descriptor, the tag/coordinates of the docker image is encoded in the distributionLocation element, as in the example below where bibliome is the repository name, alvisengine is the specific tag used to name the alvis image and 1.0.0 is the version of the image.

<distributionLocation>bibliome/alvisengine:1.0.0</distributionLocation>

Package for the dockerized components

Up-to-date version of the dockerfile and the required resources for the build process : the build process must end up with a docker image containing the component(s)
The OMTD descriptor of the docker-based component respecting the above specifications, If there is more than one component in the docker image, each component must have its own OMTD descriptor. These can be uploaded to the registry by creating or editing an OMTD-SHARE metadata record. (This is still under developement at: https://docs.google.com/document/d/1bRzatRcebkAI0V2ORMy-ZVbezDrclmdnPvnMbnuCRmk/edit#heading=h.ns8v35e59ye3)