This document explains:

  1. How an OMTD-compliant docker image that contains TDM component(s) should be built.

  2. The external interface that the docker image must follow.

The above two are required in order to make the docker image usable within the OpenMinTeD platform; for example if a TDM component is dockerized according to the given specifications, then the OMTD Workflow Service (based on the Galaxy workflow execution engine) will be able to call it.

Metadata specifications

The dockerized component must be described with an OMTD-SHARE descriptor. This descriptor, which is a separate file and not contained in the Docker image, contains all relevant information about the component such as its ID, how to obtain it via a Docker repository (i.e. the location of the Docker image), the parameters of the component, etc. The OMTD-SHARE descriptor must be added by the user in the OpenMinTeD platform when registering a component (cf. guidelines on the registration of dockerised components) and it is subsequently used by the OpenMinTeD platform to automatically generate additional internal configuration files, e.g. to enable the Galaxy-based OpenMinTeD Workflow Service to invoke the component.

The metadata elements we will mention in the following paragraphs are the ones required for a docker-based component to be identified, pulled, spawned and invoked into OpenMinTeD. The remaining metadata of the OMTD-SHARE descriptor must be encoded in the same way as for the non docker-based components (the examples of OMTD-SHARE metadata records for docker-based components.

  • <resourceIdentifier>: set the component id; for dockerized components, the name used for the component.

  • <command>: the command used for invoking the component(s) (including the id of the component if more than one)

  • <inputContentResourceInfo>: set of elements with the specs for the input resource (i.e. the corpus or document) that will be processed by the component

  • <outputResourceInfo>: set of elements with the specs for the output resource (i.e. annotations) that will be produced by the component

  • <parameterInfos>: set of elements used for describing the parameters used when running a component

  • <distributionLocation>: the place where components can be accessed from; for dockerized components, this is the location of the Docker image following the Docker conventions

Technical specifications

A docker image for a TDM component must be self-contained, able to provide an execution of the component task into a container and provide the final outputs. If resources that change from one execution to another are used by the component, we recommend to make them available as values of parameters of the component. Otherwise, you must ensure that the appropriate and required resources are available to each component container. The docker image of the component must be available via a docker repository accessible to OpenMinTeD (e.g., Docker Hub). Most importantly, the dockerfile of the docker image must be valid according to a check list defined by the OpenMinTeD administrators.

The OMTD docker image and TDM component that it hosts must follow a set of specifications which are described below.

The docker image must contain at least one TDM component

  • There must be a unique name for each component. The name is responsible for invoking a TDM component or workflow.

  • The component name must be a parameter that invokes a unique component within the docker run command.

In the OMTD-SHARE descriptor, the component name is described using the resourceIdentifier element as in the following example(s) and is also added in the command element, as detailed later.

Alvis Component
<resourceIdentifier resourceIdentifierSchemeName=”OMTD-docker”>
  default.modules.simpleprojector
</resourceIdentifier>
UIMA/DKPro component
<resourceIdentifier resourceIdentifierSchemeName=”OMTD-docker”>
  de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer
</resourceIdentifier>

The docker image must include a TDM component executable

  • There must be an executable which is used as the entry point for the docker image.

  • This executable is responsible for executing TDM component(s) or workflow(s). The executable must be a parameter that invokes one of the components in the docker run command, since a docker image can contain more than one TDM components.

In the OMTD-SHARE descriptor, the executable is described using the command element as in the example(s) below. The value of the command element must contain the executable (e.g., alvisnlp) and the component id (e.g., default.modules.simpleprojector) with a space between them. The value of the executable is a string, without any special character.

Alvis TDM executable
<command>alvisnlp default.modules.simpleprojector</command>
DKPro TDM executable
<command>dkpro-core de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer</command>

Reading input data and writing output data for each TDM component

  • The component must specify inputs and outputs as parameters to the docker run command in its OMTD-SHARE descriptor.

  • It must be possible to set (with a parameter named --input) the input in the docker run in the following format: --input <PATH-TO-THE-INPUT-DATA>

  • It must be possible to set (with a parameter named --output) the output in the docker run in the following format: --output <PATH-TO-THE-OUTPUT-DATA>

  • There must not be any additional parameters to indicate data input/output locations

  • Input and output data must be expected as files

  • The input and output files must be in one of the formats declared in the OMTD-SHARE descriptor

  • If a component declares multiple input/output formats in its OMTD-SHARE descriptor, there must be parameters called --input-format/--output-format which indicates which input/output format to use.

In the OMTD descriptor, the metadata for the input and the output are provided respectively into the inputContentResourceInfo and outputResourceInfo elements. At least values for the processingResourceTypes, dataFormats and characterEncodings sub-elements must be filled in.

<inputContentResourceInfo>
  <processingResourceTypes>
    <processingResourceType>corpus</processingResourceType>
  </processingResourceTypes>
  <dataFormats>
    <dataFormatInfo>
      <dataFormat>http://w3id.org/meta-share/omtd-share/Xmi</dataFormat>
    <dataFormatInfo>
  </dataFormats>
  <characterEncodings>
    <characterEncoding>UTF8</characterEncoding>
  </characterEncodings>
  [...]
</inputContentResourceInfo>
<outputResourceInfo>
  <processingResourceTypes>
    <processingResourceType>corpus</processingResourceType>
  </processingResourceTypes>
  <dataFormats>
    <dataFormatInfo>
      <dataFormat>http://w3id.org/meta-share/omtd-share/WebAnnotationFormat</dataFormat>
    <dataFormatInfo>
  </dataFormats>
  <characterEncodings>
    <characterEncoding>UTF8</characterEncoding>
  </characterEncodings>
  [...]
</outputResourceInfo>

Accepting parameters for each TDM component

  • If the component declares a parameter in its OMTD-SHARE descriptor, then it must be possible to specify this parameter in the docker run in the following format: --param:<PARAMETER-NAME>=<PARAMETER-VALUE>

  • If a parameter accepts multiple values, then these must be comma-separated

  • If a value contains a comma, it must be escaped using a backslash: \,

  • If a value contains a backslash, it must be escaped using a second backslash: \\

In the OMTD description, the metadata for the parameters are filled in the parameterInfo element. Values for the name, parameterType, optional sub-elements are required. The following example describes a parameter.

<parameterInfo>
  <parameterName>targetlayerName</parameterName>
  <parameterLabel>target Name Layer</parameterLabel>
  <parameterDescription>
    Name of the layer that contains the match annotations.
  </parameterDescription>
  <parameterType>string</parameterType>
  <optional>true</optional>
  <multiValue>false</multiValue>
  <defaultValue>concepts</defaultValue>
</parameterInfo>

Fully identify the docker image

  • following the docker convention, each image must have a full tag composed of a repository name, a specific tag and a version of the image (e.g., {repository-name}/{specific-tag}:{version})

In the OMTD descriptor, the tag/location of the docker image is encoded in the distributionLocation element, as in the example below where bibliome is the repository name, alvisengine is the specific tag used to name the alvis image and 1.0.0 is the version of the image.

<distributionLocation>bibliome/alvisengine:1.0.0</distributionLocation>

Package for the dockerized components

  • Up-to-date version of the docker file and the required resources for the build process : the build process must end up with a docker image containing the component(s)

  • The OMTD-SHARE descriptor of the docker-based component respecting the above specifications, If there is more than one component in the docker image, each component must have its own OMTD-SHARE descriptor. These can be uploaded to the registry by creating or editing an OMTD-SHARE metadata record at: https://services.openminted.eu/home