HOWTO: Build an OpenMinTeD runnable component based on AlvisNLP/ML docker image

HOWTO: Build an OpenMinTeD runnable component based on AlvisNLP/ML docker image

This document describes how to setup an OpenMinTeD runnable component from the AlvisNLP/ML modules.

We use the AlvisNLP/ML framework packaged as a docker image (available into docker hub) that includes all AlvisNLP/ML modules and ressources. The guidelines specifically describe how AlvisNLP/ML “plans” are used to adapt modules as runnable components and how the components are described to fit OpenMinTeD requirements.

Requirements

  • docker version 1.13.1
  • 4Gb available on disque
  • Basic XML and Java knowledge

AlvisNLP/ML Basics

AlvisNLP/ML framework is a corpus processing engine that features a library of processing modules. These library includes modules for tokenization, sentence splitting, POS-tagging, parsing, NER, relation extraction, etc.

A plan is a preconfigured receipt using the Alvis elementary components in order to define a specific runable module. These runnable modules are workflows but in this OpenMinTeD context they are seen as OpenMinTeD compatible modules. Thus, rather than composing several modules, a plan here lets us just adapt an Alvis module to an OpenMinTeD component by preparing the interface for its inputs, outputs and parameters.

Define a runnable component with an Alvis plan

A plan for a runnable component is a XML file (with extension .plan) that contains 3 parts : a read part that configures the inputs, a write part that configures the outputs, and a process part that configures the task of the Alvis module being adapted as an OpenMinTeD runnable component.

The following plan adapts the Alvis module named WoSMig to an runnable component. WoSMig do tokenization of text documents. The plan is composed of the Alvis module TextFileReader to read text files, the module TabularExport to export the results as tabular forms, and the process module WoSMig doing the tokenization task. All runnable components are set in this way, it is just the read and write parts who changes according to the needs. The process modules to use are available into Alvis. Alvis also has several typical modules for the read and write parts (new modules can also be implemented, for example to convert new formats).

<alvisnlp-plan id="OMTD_WoSMig">
	<read class="TextFileReader"/>
	<annotation class="WoSMig"/>
	<write class="TabularExport"/>
</alvisnlp-plan>

You can feed values of parameters (that don’t require to be used as input parameter of the component) into the plan. That has the double advantage of recording the optimal parameters and values and reducing for the end user the number of input parameters to consider. In the following modified plan parameters ponctuations and balancedPuntuations of module WoSMig are fed.

<alvisnlp-plan id="OMTD_WoSMig">
	<read class="TextFileReader"/>
	<annotation class="WoSMig">
  		<punctuations>?.!;,:-</punctuations>
  		<balancedPunctuations>()[]{}""</balancedPunctuations>
	</annotation>
	<write class="TabularExport"/>
</alvisnlp-plan>

IMPORTANT: Note that, what interests us here is using the Alvis plans to make the Alvis modules compatible with OpenMinTeD. Plans are used in a general way to define complexe workflows. A more complete presentation of how to write plans is available here.

The previous plan defines an autonomous and runnable component that can be executed with the following command. The -v option is used to mount the directory where the input and output data will be accessed by the docker image. mandiayba/alvisengine:1.0.0 is used to identify the docker image and alvisnlp is used to run the alvisengine on the parameters. The defined plan is fed as a parameter to the alvis engine.

docker run -i --rm -a stderr -v $PWD/workdir:/opt/alvisnlp/data mandiayba/alvisengine:1.0.0 
           alvisnlp
           -param read sourcePath /opt/alvisnlp/data[/path/to/text/files]  # `sourcePath` to locate input by component `TextFileReader`
           -param write outDir /opt/alvisnlp/data[/path/to/the/outdirectory/] # `outDir` to locate output by component `TabularExport` 
	   -param WoSMiG ... # params can be added to component `WoSMig` if needed
           /path/to/the/plan.plan

Defining a plan requires you to know Alvis and its modules. However, most of the time you will be re-using existing plans that are created by the Alvis developers. To know which modules to use, you can ckeck in command line with a docker container using the following commands.

docker run mandiayba/alvisengine:1.0.0 alvisnlp -supportedModules # Alvis general help

docker run mandiayba/alvisengine:1.0.0 alvisnlp -supportedModules # list modules, including some typical readers and writers

docker run mandiayba/alvisengine:1.0.0 alvisnlp -supportedConversions # list more complex converters

docker run run mandiayba/alvisengine:1.0.0 alvisnlp -moduleDoc WoSMig # a user-document of component named `WoSMig` 

Describe the runnable component for OpenMinTeD

With the autonomous and runnable component, OpenMinTeD requires you to provide a description based on the OpenMinTeD Metadata Schema for the component. We thus use that schema to describe the component. At least, the description instances of the mandatory elements of the OpenMinTeD Schema are required. Alvis automatically generates some element instances of the schema (module name and presentation, input and output parameter description, etc.), some others currently need to be defined by hand (i.e., external resources, citation, etc.). Regardless the method, what is important is to provide a valid XML description (against the schema) of the component.

A particular attention is required for the metadata directly related to the component execution. They are those used to execute a component including command, input and output parameters. The command metadata (see command element) is similar to the command presented in the previous section, with the values of the parameters contained in variables referencing parameter names of the component. The plan is seen as an ancillary resource identified and localized with metadata element relatedResource.

The following command is a value of metadata element command. It assumes the existence of two parameters of the component having values incorpus and outdir as instances of parameterName elements. It also assumes that the plan of the component is described as a ancillary resource (see here for how to fully describe an ancillary resource).

docker run -i --rm -a stderr -v /path/to/OMTD_Workdir:/opt/alvisnlp/data mandiayba/alvisengine:1.0.0 
           alvisnlp
           -param read sourcePath ${incorpus}  # additional params can exist according to the component `read` 
           -param write outDir ${outdir} # additional params can exist according to the component  `write` 
	   -param WoSMiG ... # params can be added to the component `tomap` if required by the usage
           /path/to/the/relatedResource.plan # the plan defined for the module is provided as a related resource

IMPORTANT: We assume in the above command that OpenMinTeD will do the matching between the path to the mounted OMTD_Workdir and the paths to the input (and output) data.

Each component parameter must be described in the metadata, at least with a name, a description, and a type (or format). That description, generated by the Alvis engine, is required to feed the parameter values from OpenMinTeD forms.

The package for OpenMinTeD

The package will contain the two XML files representing the description and the plan of the component. Ancillary resources can be added according to the component.

IMPORTANT: We assume that the OpenMinTeD platform manages the docker images and containers. The deployment of the Alvis engine and its modules for the execution of the components will thus be implicit.