- Introduction
- Overviews
- Compliance
- Requirements
- 1. Components must be described by machine-readable metadata
- 2. Component metadata have to be embedded into the component source code
- 3. Component metadata must be separable from the component
- 4. URL to actual content must be discoverable
- 5. Components must detail all their environmental requirements for execution
- 6. Components should have a unique identifier and a version number
- 7. Components must have a fully qualified name that follows the Java class naming conventions
- 8. Components must associate themselves with categories defined by the OpenMinTeD project
- 9. Components must declare their annotation schema dependencies
- 10. Components should specify the types of the annotations that they input and output
- 11. Components must declare whether they can be scaled within a workflow
- 12. Components should provide documentation describing their functionality
- 13. Citation information for component should be included in the metadata
- 14. Components must maintain License information
- 15. Human readable information should be provided by each resource
- 16. Models/resources should be useable across different component collections/platforms
- 17. Components should be stateless
- 18. Workflows should be described using an uniform language
- 19. Components that use external knowledge resources should delegate access to a resource adapter instead of handling it themselves
- 20. Workflow engines should not require to see data
- 21. Configuration and parametrizable options of the components should be identified and documented
- 22. The Workflow Engine Should Permit Saving Experimental Conditions in a Workflow
- 23. The Workflow Engine should permit Licence Aggregation in Workflows
- 24. Using/treating workflows as components
- 25. Incorporation of multiple resources in parallel
- 26. It should be possible to determine the source of an annotation/assigned category
- 27. Components should handle failures gracefully
- 28. Processing components should be downloadable
- 29. The actual content of all content resources must be discoverable
- 30. Metrics for the confidence level of the TDM operation should be included in the metadata
- 31. Metrics for the performance of the TDM operation should be included in the metadata
- 32. Version must be included in the metadata description for all resources
- 33. Licensing information must be included in the metadata
- 34. Licensing information should be expressed in a machine-readable form
- 35. All resources must include a unique persistent identifier
- 36. Classification metadata should be included, where applicable, in the metadata record of the resource
- 37. Information on the structural annotation (layout) of resources should be included in the metadata of the resource
- 38. Access mode of resources must be included in the metadata
- 39. Content resources must include metadata on their format (e.g. XML, DOCX etc.)
- 40. Component metadata must include standardised categories/tags that make them easy to discover
- 41. Content resources must include metadata on their language(s)
- 42. The metadata can include the information on which projects/workflows involve the resource
- 43. S/W (tools, web services, workflows) must indicate whether they are language-independent or the language(s) of the resources they take as input and output
- 44. Statistical metadata that allow monitoring of resource versions may accompany resources
- 45. S/W (tools, web services, workflows) must indicate format of their output
- 46. Output resources of web services/workflows must be accompanied by provenance metadata
- 47. Information on funding of resources may be included in the metadata
- 48. All resource metadata records must include a reference to the metadata schema used for their description
- 49. Metadata of tools should contain information about the models available for them
- 50. Documentation references should be versioned
- 51. License should be attached
- 52. License information must be in metadata
- 53. Licensor must be entitled to grant license
- 54. Licensees should remain with a copy of the license
- 55. Standard licenses should be used
- 56. License should be machine readable
- 57. License should be understandable by non-lawyers
- 58. TDM must be explicitly allowed
- 59. Right for (temporary) reproduction must be granted
- 60. Boundary for derivative work must be clearly defined
- 61. No restrictions on TDM results which are not derived works
- 62. World-wide and irrevocable license grant
- 63. License must qualify for Open Access rights
- 64. License must qualify for Open Access uses
- 65. License must qualify for Open Access must not restrict use in any way
- 66. License must qualify for Open Access may include attribution requirements
- 67. Knowledge Resource Element Id
- 68. Data Category Linking Vocabulary
- 69. Interoperability between elements from different knowledge resource schemas should be expressed through RDF statements.
- 70. All KR content elements need to be added as text annotations within a TDM workflow.
- 71. The KR should be ingestible through a URI
- 72. The KR format should be in a standard format such as XML, JSON-LD or RDF/XML.
- 73. Stick to widely used data compression formats
- 74. Machine-readable metadata for UIMA components
- 75. Embedding UIMA component metadata into the source code
- 76. Separating UIMA metadata from the component
- 77. Unique identifiers and versions for UIMA components
- 78. Specifying input and output types of UIMA components
- 79. Documentation of UIMA components
- 80. Common elements to represent/describe an executable workflow
- 81. Embedding GATE component metadata into the source code
- 82. Unique identifiers and versions for GATE components
- 83. Documentation of GATE components
- 84. Separating GATE metadata from the component
- 85. Unique identifier and version for components in the OMTD-SHARE schema
- 86. Attaching format properties to the description of the inputs and outputs that use file
- 87. Embedding language capability in UIMA component metadata
- 88. Embedding output format in UIMA component metadata
- 89. Version documentation in parallel with component/resource
- 90. Components must be assigned at least one category from the OMTD-SHARE controlled vocabulary for component types
- 91. Encoding citable publications (for scholarly attribution) in resource metadata records
- 92. Including license text in resource packages
- 93. Provide identifiers for knowledge resource elements
- 94. Data Category Linking Vocabulary
- 95. The KR format should be in a standard format such as XML, JSON-LD or RDF/XML.
- 96. Unique identifiers and versions for components using Maven
- 97. Declaring scaleout capability in UIMA
- 98. Publishing components via software repositories (Maven, Docker)
- 99. Encoding in the metadata a direct access link for content resources
- 100. Providing access to content resources (sharing/exposing and transferring)
- 101. Making models and annotation resources accessible as entities distinct from the components they are compatible with
- 102. Adding version information in the metadata descriptions of all resources
- 103. Specifying access mode of resources and encoding it in the metadata descriptions
- 104. Encoding funding information in the metadata descriptions of all resources
- 105. Encoding of format in the metadata description of content resources
- 106. Encoding licensing terms in the metadata description of the resource
- 107. Encoding metadata on domain/subject/ classification for all resources when applicable
- 108. Encoding language information in the metadata of content resources
- 109. Encoding statistical information in the content resources
- 110. Assigning a unique persistent identifier for all resources
- 111. Annotation schema dependencies for UIMA components using Maven
- Bibliography
This document will host the interoperability specification for the OpenMinTeD platform. Currently, there is not much to see here. For the time being, we use this place to dump intermediate information that may or may not end up in the final specification.
Introduction
Requirement metadata
ID
Every requirement has an ID. We start counting from 1 and every new requirement increments the
ID by 1. The ID is encoded in the requirement filename, e.g. 1.adoc
.
Concreteness
The OpenMinTeD platform aims to be open and inclusive. As such, it aims to support multiple popular technologies and standards. As popularity is changing over time and as new standards and technologies are evolving, the platform will have to evolve as well. The distinction between abstract and concrete interoperability requirements that we make here allows us to answer two questions:
-
How difficult is it for a new technology or standard to be incorporated into the platform?
-
How difficult is it to integrate new components based on already supported technologies and standards into the platform.
Abstract requirements are agnostic to concrete technologies and standards and help assessing compliance with them; helps answering the first question. Concrete requirements refer to specific implementation details and help answering the second question.
-
Abstract - the requirement specifies a need, but does not go into details how this need must be fulfilled. The requirement may provide examples of techniques or implementations that fulfill the requirement, but does not mandate their use.
-
Concrete - the requirement specifies a need and prescribes the use of specific techniques, standards, implementations, etc.
Strength
-
Mandatory - compliance with a mandatory requirement is obligatory. Non-compliance with any mandatory requirement entails non-compliance with the specification as a whole.
-
Recommended - compliance with a recommended requirement is not obligatory but strongly desired.
-
Optional - compliance with an optional requirement is not obligatory and not strongly desired, but considered beneficial.
Status
The requirement status indicates how far it has proceeded in its lifecycle. If and which changes may be made to a requirement depends on this status.
-
Draft - the requirement is a suggestion and can be changed substantially in any respect.
-
Final - the requirement is ready for release. Changes to a final requirement are only allowed if they do not affect the compliance status of any product, component, format, etc. that as already been evaluated against the requirement specification. If a change would trigger a change in any compliance status, instead of changing an existing requirement, a new requirement must be created under a new ID and compliance must be evaluated against this new requirement specification in the next iteration. The previous requirement must be moved to deprecated status.
-
Deprecated - the requirement is no longer to be used for compliance assessment. The requirement specification must not be changed. An exception are amendments adding pointers to potential new versions of the requirement and providing a rationale for the deprecation.
Category
The category of a requirement is used to anchor it in the document structure of the interoperability specification. The actual document structure is kept in a separate file to faciliate its refactoring. A requirement may be in multiple categories which must be provided as a comma-separated list.
Compliance assessment
Requirement compliance levels
-
Full - fully compliant
-
Partial - partially compliant. E.g. some parts of a product are compliant but not all. This is typically the case if a product is in a state of transition from a non-compliant to a compliant state.
-
No - not compliant yet, but could be made compliant in the future.
-
N/A - not applicable. This is used when a requirement is considered out-of-scope for a given product. Typically this occurs when multiple products belonging to a broadly similar category are considered potentially relevant for a requirement, but when examining in detail, the requirement does not make sense for a particular product. An example would be 20. Workflow engines should not require to see data which describes a workflow-related requirement. This would be applicable to products including a workflow engine such as GATE, but not to pure component collections such as DKPro Core (note that GATE and DKPro Core typically occur together as relevant products in requirements).
If a product does not appear at all in a compliance table, then that is because there never was any question that it is not applicable and no justification is required.
Overall compliance levels
-
None - not compliant at all
-
Silver - fully compliant with all mandatory requirements
-
Gold - fully compliant with all mandatory and recommended requirements
-
Platin - fully compliant with all mandatory, recommended, and optional requirements
Overall compliance levels are maintained separately for abstract and concrete requirements. A product may attain a higher compliance level on abstract requirements than on concrete requirements. For example, an abstract mandatory requirement may be to provide metadata in a machine-readable format while a related concrete requirement may prescribe a certain metadata format. If a product offers metadata in a machine-readable format, it would be fully compliant with the abstract requirement, but if it does not use the metadata format prescribed by the concrete requirement, it is not compliant here. The product may thus attain silver compliance on the abstract level, but no overall compliance level on the concrete level.
Adding new requirements
Files and IDs
Requirements are maintained in the folder req
, one requirement per file, and files are numbered
starting at 1. The number corresponds to the ID of the requirement.
Template
This is the template we presently use for recording the requirements. It contains the requirement metadata as described above and a description. Links to other requirements can be embedded in the description.
== Here goes the requirement title
[%hardbreaks]
[small]#*_Concreteness:_* __abstract|concrete__#
[small]#*_Strength:_* __mandatory|recommended|optional__#
[small]#*_Category:_* __WG1__,__WG2__,__WG3__,__WG4__#
[small]#*_Status:_* __draft|final|deprecated__#
This is where the description of the requirement goes. If you wish to reference another requirement from here, you can do it like link:{include-dir}req/1.adoc[this].
// Below is an example of how a compliance evaluation table could look. This is presently optional
// and may be moved to a more structured/principled format later maintained in separate files.
[cols="2,1,1,4,1"]
|====
|Product|Version|Compliant|Justification|Status
| Alvis
|
| Unknown
| Unknown
| Draft
| ARGO
| 0.5
| Unknown
| Unknown
| Draft
| DKPro Core
| 1.8.0
| Unknown
| Unknown
| Draft
| GATE
| 8.2
| Unknown
| Unknown
| Draft
| ILSP
|
| Unknown
| Unknown
| Draft
|====
src/main/asciidoc/openminted-interoperability-spec/req/TEMPLATE.adoc
And this is how the requirement is rendered in this document (except the heading which doesn’t work in this particular example).
== Here goes the requirement title
Concreteness: abstract|concrete
Strength: mandatory|recommended|optional
Category: WG1,WG2,WG3,WG4
Status: draft|final|deprecated
This is where the description of the requirement goes. If you wish to reference another requirement from here, you can do it like this.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
Unknown |
Unknown |
Draft |
|
ARGO |
0.5 |
Unknown |
Unknown |
Draft |
DKPro Core |
1.8.0 |
Unknown |
Unknown |
Draft |
GATE |
8.2 |
Unknown |
Unknown |
Draft |
ILSP |
Unknown |
Unknown |
Draft |
src/main/asciidoc/openminted-interoperability-spec/req/TEMPLATE.adoc
Overviews
By WG
WG1 (40)
ID | Requirement | Concreteness | Status | Strength | WG’s |
---|---|---|---|---|---|
URL to actual content must be discoverable |
abstract |
final |
mandatory |
||
Citation information for component should be included in the metadata |
abstract |
draft |
recommended |
WG1, WG4 (44) |
|
Licensing information must be included in the metadata |
abstract |
final |
mandatory |
WG1, WG3 (25) |
|
Licensing information should be expressed in a machine-readable form |
abstract |
final |
recommended |
WG1, WG3 (25) |
|
Classification metadata should be included, where applicable, in the metadata record of the resource |
abstract |
final |
recommended |
WG1, WG2 (27) |
|
Information on the structural annotation (layout) of resources should be included in the metadata of the resource |
abstract |
final |
recommended |
WG1 |
|
Access mode of resources must be included in the metadata |
abstract |
final |
mandatory |
||
Content resources must include metadata on their format (e.g. XML, DOCX etc.) |
abstract |
final |
mandatory |
WG1 |
|
Content resources must include metadata on their language(s) |
abstract |
final |
mandatory |
WG1, WG2 (27) |
|
S/W (tools, web services, workflows) must indicate whether they are language-independent or the language(s) of the resources they take as input and output |
abstract |
final |
mandatory |
WG1, WG4 (44) |
|
Statistical metadata that allow monitoring of resource versions may accompany resources |
abstract |
final |
optional |
WG1, WG2 (27) |
|
S/W (tools, web services, workflows) must indicate format of their output |
abstract |
final |
mandatory |
WG1, WG4 (44) |
|
Information on funding of resources may be included in the metadata |
abstract |
final |
optional |
||
Documentation references should be versioned |
abstract |
final |
recommended |
||
Machine-readable metadata for UIMA components |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Embedding UIMA component metadata into the source code |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Separating UIMA metadata from the component |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Specifying input and output types of UIMA components |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Documentation of UIMA components |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Embedding GATE component metadata into the source code |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Documentation of GATE components |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Separating GATE metadata from the component |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Embedding output format in UIMA component metadata |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Version documentation in parallel with component/resource |
concrete |
draft |
recommended |
||
Components must be assigned at least one category from the OMTD-SHARE controlled vocabulary for component types |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Encoding citable publications (for scholarly attribution) in resource metadata records |
concrete |
draft |
recommended |
||
Including license text in resource packages |
concrete |
draft |
recommended |
||
Unique identifiers and versions for components using Maven |
concrete |
draft |
mandatory |
WG1, WG4 (44) |
|
Encoding in the metadata a direct access link for content resources |
concrete |
draft |
mandatory |
WG1 |
|
Providing access to content resources (sharing/exposing and transferring) |
concrete |
draft |
mandatory |
WG1 |
|
Making models and annotation resources accessible as entities distinct from the components they are compatible with |
concrete |
draft |
recommended |
||
Adding version information in the metadata descriptions of all resources |
concrete |
draft |
mandatory |
||
Specifying access mode of resources and encoding it in the metadata descriptions |
concrete |
draft |
mandatory |
||
Encoding funding information in the metadata descriptions of all resources |
concrete |
draft |
recommended |
||
Encoding of format in the metadata description of content resources |
concrete |
draft |
mandatory |
WG1 |
|
Encoding licensing terms in the metadata description of the resource |
concrete |
draft |
mandatory |
WG1, WG3 (25) |
|
Encoding metadata on domain/subject/ classification for all resources when applicable |
concrete |
draft |
recommended |
WG1, WG2 (27) |
|
Encoding language information in the metadata of content resources |
concrete |
draft |
mandatory |
WG1, WG2 (27) |
|
Encoding statistical information in the content resources |
concrete |
draft |
mandatory |
WG1, WG2 (27) |
|
Assigning a unique persistent identifier for all resources |
concrete |
draft |
mandatory |
WG2 (27)
ID | Requirement | Concreteness | Status | Strength | WG’s |
---|---|---|---|---|---|
URL to actual content must be discoverable |
abstract |
final |
mandatory |
||
Components should specify the types of the annotations that they input and output |
abstract |
draft |
mandatory |
WG4 (44), WG2 |
|
Classification metadata should be included, where applicable, in the metadata record of the resource |
abstract |
final |
recommended |
WG1 (40), WG2 |
|
Access mode of resources must be included in the metadata |
abstract |
final |
mandatory |
||
Content resources must include metadata on their language(s) |
abstract |
final |
mandatory |
WG1 (40), WG2 |
|
Statistical metadata that allow monitoring of resource versions may accompany resources |
abstract |
final |
optional |
WG1 (40), WG2 |
|
Information on funding of resources may be included in the metadata |
abstract |
final |
optional |
||
Documentation references should be versioned |
abstract |
final |
recommended |
||
Knowledge Resource Element Id |
abstract |
final |
recommended |
WG2 |
|
Data Category Linking Vocabulary |
abstract |
final |
recommended |
WG2 |
|
Interoperability between elements from different knowledge resource schemas should be expressed through RDF statements. |
abstract |
final |
recommended |
WG2 |
|
All KR content elements need to be added as text annotations within a TDM workflow. |
abstract |
final |
mandatory |
WG2 |
|
The KR should be ingestible through a URI |
abstract |
final |
recommended |
WG2 |
|
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
abstract |
final |
recommended |
WG2 |
|
Version documentation in parallel with component/resource |
concrete |
draft |
recommended |
||
Encoding citable publications (for scholarly attribution) in resource metadata records |
concrete |
draft |
recommended |
||
Provide identifiers for knowledge resource elements |
concrete |
draft |
recommended |
WG2 |
|
Data Category Linking Vocabulary |
concrete |
draft |
recommended |
WG2 |
|
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
concrete |
draft |
recommended |
WG2 |
|
Making models and annotation resources accessible as entities distinct from the components they are compatible with |
concrete |
draft |
recommended |
||
Adding version information in the metadata descriptions of all resources |
concrete |
draft |
mandatory |
||
Specifying access mode of resources and encoding it in the metadata descriptions |
concrete |
draft |
mandatory |
||
Encoding funding information in the metadata descriptions of all resources |
concrete |
draft |
recommended |
||
Encoding metadata on domain/subject/ classification for all resources when applicable |
concrete |
draft |
recommended |
WG1 (40), WG2 |
|
Encoding language information in the metadata of content resources |
concrete |
draft |
mandatory |
WG1 (40), WG2 |
|
Encoding statistical information in the content resources |
concrete |
draft |
mandatory |
WG1 (40), WG2 |
|
Assigning a unique persistent identifier for all resources |
concrete |
draft |
mandatory |
WG3 (25)
ID | Requirement | Concreteness | Status | Strength | WG’s |
---|---|---|---|---|---|
URL to actual content must be discoverable |
abstract |
final |
mandatory |
||
Licensing information must be included in the metadata |
abstract |
final |
mandatory |
WG1 (40), WG3 |
|
Licensing information should be expressed in a machine-readable form |
abstract |
final |
recommended |
WG1 (40), WG3 |
|
Information on funding of resources may be included in the metadata |
abstract |
final |
optional |
||
Documentation references should be versioned |
abstract |
final |
recommended |
||
License should be attached |
abstract |
draft |
recommended |
WG3 |
|
Licensor must be entitled to grant license |
abstract |
draft |
recommended |
WG3 |
|
Licensees should remain with a copy of the license |
abstract |
draft |
recommended |
WG3 |
|
Standard licenses should be used |
abstract |
draft |
recommended |
WG3 |
|
License should be machine readable |
abstract |
draft |
recommended |
WG3 |
|
License should be understandable by non-lawyers |
abstract |
draft |
recommended |
WG3 |
|
TDM must be explicitly allowed |
abstract |
draft |
recommended |
WG3 |
|
Right for (temporary) reproduction must be granted |
abstract |
draft |
recommended |
WG3 |
|
Boundary for derivative work must be clearly defined |
abstract |
draft |
recommended |
WG3 |
|
No restrictions on TDM results which are not derived works |
abstract |
draft |
recommended |
WG3 |
|
World-wide and irrevocable license grant |
abstract |
draft |
recommended |
WG3 |
|
License must qualify for Open Access rights |
abstract |
draft |
recommended |
WG3 |
|
License must qualify for Open Access uses |
abstract |
draft |
recommended |
WG3 |
|
License must qualify for Open Access must not restrict use in any way |
abstract |
draft |
recommended |
WG3 |
|
License must qualify for Open Access may include attribution requirements |
abstract |
draft |
recommended |
WG3 |
|
Version documentation in parallel with component/resource |
concrete |
draft |
recommended |
||
Including license text in resource packages |
concrete |
draft |
recommended |
||
Adding version information in the metadata descriptions of all resources |
concrete |
draft |
mandatory |
||
Encoding licensing terms in the metadata description of the resource |
concrete |
draft |
mandatory |
WG1 (40), WG3 |
|
Assigning a unique persistent identifier for all resources |
concrete |
draft |
mandatory |
WG4 (44)
ID | Requirement | Concreteness | Status | Strength | WG’s |
---|---|---|---|---|---|
Components must be described by machine-readable metadata |
abstract |
final |
mandatory |
WG4 |
|
Component metadata have to be embedded into the component source code |
abstract |
final |
mandatory |
WG4 |
|
Component metadata must be separable from the component |
abstract |
final |
mandatory |
WG4 |
|
Components must detail all their environmental requirements for execution |
abstract |
draft |
mandatory |
WG4 |
|
Components should have a unique identifier and a version number |
abstract |
draft |
mandatory |
WG4 |
|
Components must have a fully qualified name that follows the Java class naming conventions |
concrete |
final |
mandatory |
WG4 |
|
Components must associate themselves with categories defined by the OpenMinTeD project |
abstract |
final |
mandatory |
WG4 |
|
Components must declare their annotation schema dependencies |
abstract |
final |
mandatory |
WG4 |
|
Components should specify the types of the annotations that they input and output |
abstract |
draft |
mandatory |
WG4, WG2 (27) |
|
Components must declare whether they can be scaled within a workflow |
abstract |
draft |
mandatory |
WG4 |
|
Components should provide documentation describing their functionality |
abstract |
final |
recommended |
WG4 |
|
Citation information for component should be included in the metadata |
abstract |
draft |
recommended |
WG1 (40), WG4 |
|
Models/resources should be useable across different component collections/platforms |
abstract |
final |
recommended |
WG4 |
|
Components should be stateless |
concrete |
final |
recommended |
WG4 |
|
Configuration and parametrizable options of the components should be identified and documented |
abstract |
final |
recommended |
WG4 |
|
It should be possible to determine the source of an annotation/assigned category |
abstract |
final |
recommended |
WG4 |
|
Processing components should be downloadable |
abstract |
final |
recommended |
WG4 |
|
Access mode of resources must be included in the metadata |
abstract |
final |
mandatory |
||
S/W (tools, web services, workflows) must indicate whether they are language-independent or the language(s) of the resources they take as input and output |
abstract |
final |
mandatory |
WG1 (40), WG4 |
|
S/W (tools, web services, workflows) must indicate format of their output |
abstract |
final |
mandatory |
WG1 (40), WG4 |
|
Information on funding of resources may be included in the metadata |
abstract |
final |
optional |
||
Documentation references should be versioned |
abstract |
final |
recommended |
||
Stick to widely used data compression formats |
concrete |
draft |
best practice |
WG4 |
|
Machine-readable metadata for UIMA components |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Embedding UIMA component metadata into the source code |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Separating UIMA metadata from the component |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Specifying input and output types of UIMA components |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Documentation of UIMA components |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Embedding GATE component metadata into the source code |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Documentation of GATE components |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Separating GATE metadata from the component |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Embedding output format in UIMA component metadata |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Version documentation in parallel with component/resource |
concrete |
draft |
recommended |
||
Components must be assigned at least one category from the OMTD-SHARE controlled vocabulary for component types |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Encoding citable publications (for scholarly attribution) in resource metadata records |
concrete |
draft |
recommended |
||
Including license text in resource packages |
concrete |
draft |
recommended |
||
Unique identifiers and versions for components using Maven |
concrete |
draft |
mandatory |
WG1 (40), WG4 |
|
Declaring scaleout capability in UIMA |
concrete |
draft |
mandatory |
WG4 |
|
Publishing components via software repositories (Maven, Docker) |
concrete |
draft |
mandatory |
WG4 |
|
Making models and annotation resources accessible as entities distinct from the components they are compatible with |
concrete |
draft |
recommended |
||
Adding version information in the metadata descriptions of all resources |
concrete |
draft |
mandatory |
||
Specifying access mode of resources and encoding it in the metadata descriptions |
concrete |
draft |
mandatory |
||
Encoding funding information in the metadata descriptions of all resources |
concrete |
draft |
recommended |
||
Annotation schema dependencies for UIMA components using Maven |
concrete |
draft |
mandatory |
WG4 |
By Status
deprecated (27)
ID | Requirement | Concreteness | Strength | WG’s |
---|---|---|---|---|
Components must maintain License information |
abstract |
mandatory |
||
Human readable information should be provided by each resource |
abstract |
recommended |
||
Workflows should be described using an uniform language |
abstract |
recommended |
||
Components that use external knowledge resources should delegate access to a resource adapter instead of handling it themselves |
abstract |
optional |
||
Workflow engines should not require to see data |
concrete |
recommended |
||
The Workflow Engine Should Permit Saving Experimental Conditions in a Workflow |
abstract |
recommended |
||
The Workflow Engine should permit Licence Aggregation in Workflows |
abstract |
recommended |
||
Using/treating workflows as components |
abstract |
mandatory |
||
Incorporation of multiple resources in parallel |
abstract |
recommended |
||
Components should handle failures gracefully |
abstract |
recommended |
||
The actual content of all content resources must be discoverable |
abstract |
mandatory |
||
Metrics for the confidence level of the TDM operation should be included in the metadata |
abstract |
optional |
||
Metrics for the performance of the TDM operation should be included in the metadata |
abstract |
optional |
||
Version must be included in the metadata description for all resources |
abstract |
mandatory |
||
All resources must include a unique persistent identifier |
abstract |
mandatory |
||
Component metadata must include standardised categories/tags that make them easy to discover |
abstract |
mandatory |
||
The metadata can include the information on which projects/workflows involve the resource |
abstract |
optional |
||
Output resources of web services/workflows must be accompanied by provenance metadata |
abstract |
mandatory |
||
All resource metadata records must include a reference to the metadata schema used for their description |
abstract |
mandatory |
||
Metadata of tools should contain information about the models available for them |
abstract |
recommended |
||
License information must be in metadata |
abstract |
recommended |
||
Unique identifiers and versions for UIMA components |
concrete |
mandatory |
||
Common elements to represent/describe an executable workflow |
concrete |
recommended |
||
Unique identifiers and versions for GATE components |
concrete |
mandatory |
||
Unique identifier and version for components in the OMTD-SHARE schema |
concrete |
mandatory |
||
Attaching format properties to the description of the inputs and outputs that use file |
concrete |
recommended |
||
Embedding language capability in UIMA component metadata |
concrete |
mandatory |
draft (53)
ID | Requirement | Concreteness | Strength | WG’s |
---|---|---|---|---|
Components must detail all their environmental requirements for execution |
abstract |
mandatory |
||
Components should have a unique identifier and a version number |
abstract |
mandatory |
||
Components should specify the types of the annotations that they input and output |
abstract |
mandatory |
||
Components must declare whether they can be scaled within a workflow |
abstract |
mandatory |
||
Citation information for component should be included in the metadata |
abstract |
recommended |
||
License should be attached |
abstract |
recommended |
||
Licensor must be entitled to grant license |
abstract |
recommended |
||
Licensees should remain with a copy of the license |
abstract |
recommended |
||
Standard licenses should be used |
abstract |
recommended |
||
License should be machine readable |
abstract |
recommended |
||
License should be understandable by non-lawyers |
abstract |
recommended |
||
TDM must be explicitly allowed |
abstract |
recommended |
||
Right for (temporary) reproduction must be granted |
abstract |
recommended |
||
Boundary for derivative work must be clearly defined |
abstract |
recommended |
||
No restrictions on TDM results which are not derived works |
abstract |
recommended |
||
World-wide and irrevocable license grant |
abstract |
recommended |
||
License must qualify for Open Access rights |
abstract |
recommended |
||
License must qualify for Open Access uses |
abstract |
recommended |
||
License must qualify for Open Access must not restrict use in any way |
abstract |
recommended |
||
License must qualify for Open Access may include attribution requirements |
abstract |
recommended |
||
Stick to widely used data compression formats |
concrete |
best practice |
||
Machine-readable metadata for UIMA components |
concrete |
mandatory |
||
Embedding UIMA component metadata into the source code |
concrete |
mandatory |
||
Separating UIMA metadata from the component |
concrete |
mandatory |
||
Specifying input and output types of UIMA components |
concrete |
mandatory |
||
Documentation of UIMA components |
concrete |
mandatory |
||
Embedding GATE component metadata into the source code |
concrete |
mandatory |
||
Documentation of GATE components |
concrete |
mandatory |
||
Separating GATE metadata from the component |
concrete |
mandatory |
||
Embedding output format in UIMA component metadata |
concrete |
mandatory |
||
Version documentation in parallel with component/resource |
concrete |
recommended |
||
Components must be assigned at least one category from the OMTD-SHARE controlled vocabulary for component types |
concrete |
mandatory |
||
Encoding citable publications (for scholarly attribution) in resource metadata records |
concrete |
recommended |
||
Including license text in resource packages |
concrete |
recommended |
||
Provide identifiers for knowledge resource elements |
concrete |
recommended |
||
Data Category Linking Vocabulary |
concrete |
recommended |
||
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
concrete |
recommended |
||
Unique identifiers and versions for components using Maven |
concrete |
mandatory |
||
Declaring scaleout capability in UIMA |
concrete |
mandatory |
||
Publishing components via software repositories (Maven, Docker) |
concrete |
mandatory |
||
Encoding in the metadata a direct access link for content resources |
concrete |
mandatory |
||
Providing access to content resources (sharing/exposing and transferring) |
concrete |
mandatory |
||
Making models and annotation resources accessible as entities distinct from the components they are compatible with |
concrete |
recommended |
||
Adding version information in the metadata descriptions of all resources |
concrete |
mandatory |
||
Specifying access mode of resources and encoding it in the metadata descriptions |
concrete |
mandatory |
||
Encoding funding information in the metadata descriptions of all resources |
concrete |
recommended |
||
Encoding of format in the metadata description of content resources |
concrete |
mandatory |
||
Encoding licensing terms in the metadata description of the resource |
concrete |
mandatory |
||
Encoding metadata on domain/subject/ classification for all resources when applicable |
concrete |
recommended |
||
Encoding language information in the metadata of content resources |
concrete |
mandatory |
||
Encoding statistical information in the content resources |
concrete |
mandatory |
||
Assigning a unique persistent identifier for all resources |
concrete |
mandatory |
||
Annotation schema dependencies for UIMA components using Maven |
concrete |
mandatory |
final (31)
ID | Requirement | Concreteness | Strength | WG’s |
---|---|---|---|---|
Components must be described by machine-readable metadata |
abstract |
mandatory |
||
Component metadata have to be embedded into the component source code |
abstract |
mandatory |
||
Component metadata must be separable from the component |
abstract |
mandatory |
||
URL to actual content must be discoverable |
abstract |
mandatory |
||
Components must have a fully qualified name that follows the Java class naming conventions |
concrete |
mandatory |
||
Components must associate themselves with categories defined by the OpenMinTeD project |
abstract |
mandatory |
||
Components must declare their annotation schema dependencies |
abstract |
mandatory |
||
Components should provide documentation describing their functionality |
abstract |
recommended |
||
Models/resources should be useable across different component collections/platforms |
abstract |
recommended |
||
Components should be stateless |
concrete |
recommended |
||
Configuration and parametrizable options of the components should be identified and documented |
abstract |
recommended |
||
It should be possible to determine the source of an annotation/assigned category |
abstract |
recommended |
||
Processing components should be downloadable |
abstract |
recommended |
||
Licensing information must be included in the metadata |
abstract |
mandatory |
||
Licensing information should be expressed in a machine-readable form |
abstract |
recommended |
||
Classification metadata should be included, where applicable, in the metadata record of the resource |
abstract |
recommended |
||
Information on the structural annotation (layout) of resources should be included in the metadata of the resource |
abstract |
recommended |
||
Access mode of resources must be included in the metadata |
abstract |
mandatory |
||
Content resources must include metadata on their format (e.g. XML, DOCX etc.) |
abstract |
mandatory |
||
Content resources must include metadata on their language(s) |
abstract |
mandatory |
||
S/W (tools, web services, workflows) must indicate whether they are language-independent or the language(s) of the resources they take as input and output |
abstract |
mandatory |
||
Statistical metadata that allow monitoring of resource versions may accompany resources |
abstract |
optional |
||
S/W (tools, web services, workflows) must indicate format of their output |
abstract |
mandatory |
||
Information on funding of resources may be included in the metadata |
abstract |
optional |
||
Documentation references should be versioned |
abstract |
recommended |
||
Knowledge Resource Element Id |
abstract |
recommended |
||
Data Category Linking Vocabulary |
abstract |
recommended |
||
Interoperability between elements from different knowledge resource schemas should be expressed through RDF statements. |
abstract |
recommended |
||
All KR content elements need to be added as text annotations within a TDM workflow. |
abstract |
mandatory |
||
The KR should be ingestible through a URI |
abstract |
recommended |
||
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
abstract |
recommended |
By Strength
[[STR-best practice]] === best practice (1)
ID | Requirement | Concreteness | Status | WG’s |
---|---|---|---|---|
Stick to widely used data compression formats |
concrete |
draft |
mandatory (53)
ID | Requirement | Concreteness | Status | WG’s |
---|---|---|---|---|
Components must be described by machine-readable metadata |
abstract |
final |
||
Component metadata have to be embedded into the component source code |
abstract |
final |
||
Component metadata must be separable from the component |
abstract |
final |
||
URL to actual content must be discoverable |
abstract |
final |
||
Components must detail all their environmental requirements for execution |
abstract |
draft |
||
Components should have a unique identifier and a version number |
abstract |
draft |
||
Components must have a fully qualified name that follows the Java class naming conventions |
concrete |
final |
||
Components must associate themselves with categories defined by the OpenMinTeD project |
abstract |
final |
||
Components must declare their annotation schema dependencies |
abstract |
final |
||
Components should specify the types of the annotations that they input and output |
abstract |
draft |
||
Components must declare whether they can be scaled within a workflow |
abstract |
draft |
||
Components must maintain License information |
abstract |
deprecated |
||
Using/treating workflows as components |
abstract |
deprecated |
||
The actual content of all content resources must be discoverable |
abstract |
deprecated |
||
Version must be included in the metadata description for all resources |
abstract |
deprecated |
||
Licensing information must be included in the metadata |
abstract |
final |
||
All resources must include a unique persistent identifier |
abstract |
deprecated |
||
Access mode of resources must be included in the metadata |
abstract |
final |
||
Content resources must include metadata on their format (e.g. XML, DOCX etc.) |
abstract |
final |
||
Component metadata must include standardised categories/tags that make them easy to discover |
abstract |
deprecated |
||
Content resources must include metadata on their language(s) |
abstract |
final |
||
S/W (tools, web services, workflows) must indicate whether they are language-independent or the language(s) of the resources they take as input and output |
abstract |
final |
||
S/W (tools, web services, workflows) must indicate format of their output |
abstract |
final |
||
Output resources of web services/workflows must be accompanied by provenance metadata |
abstract |
deprecated |
||
All resource metadata records must include a reference to the metadata schema used for their description |
abstract |
deprecated |
||
All KR content elements need to be added as text annotations within a TDM workflow. |
abstract |
final |
||
Machine-readable metadata for UIMA components |
concrete |
draft |
||
Embedding UIMA component metadata into the source code |
concrete |
draft |
||
Separating UIMA metadata from the component |
concrete |
draft |
||
Unique identifiers and versions for UIMA components |
concrete |
deprecated |
||
Specifying input and output types of UIMA components |
concrete |
draft |
||
Documentation of UIMA components |
concrete |
draft |
||
Embedding GATE component metadata into the source code |
concrete |
draft |
||
Unique identifiers and versions for GATE components |
concrete |
deprecated |
||
Documentation of GATE components |
concrete |
draft |
||
Separating GATE metadata from the component |
concrete |
draft |
||
Unique identifier and version for components in the OMTD-SHARE schema |
concrete |
deprecated |
||
Embedding language capability in UIMA component metadata |
concrete |
deprecated |
||
Embedding output format in UIMA component metadata |
concrete |
draft |
||
Components must be assigned at least one category from the OMTD-SHARE controlled vocabulary for component types |
concrete |
draft |
||
Unique identifiers and versions for components using Maven |
concrete |
draft |
||
Declaring scaleout capability in UIMA |
concrete |
draft |
||
Publishing components via software repositories (Maven, Docker) |
concrete |
draft |
||
Encoding in the metadata a direct access link for content resources |
concrete |
draft |
||
Providing access to content resources (sharing/exposing and transferring) |
concrete |
draft |
||
Adding version information in the metadata descriptions of all resources |
concrete |
draft |
||
Specifying access mode of resources and encoding it in the metadata descriptions |
concrete |
draft |
||
Encoding of format in the metadata description of content resources |
concrete |
draft |
||
Encoding licensing terms in the metadata description of the resource |
concrete |
draft |
||
Encoding language information in the metadata of content resources |
concrete |
draft |
||
Encoding statistical information in the content resources |
concrete |
draft |
||
Assigning a unique persistent identifier for all resources |
concrete |
draft |
||
Annotation schema dependencies for UIMA components using Maven |
concrete |
draft |
optional (6)
ID | Requirement | Concreteness | Status | WG’s |
---|---|---|---|---|
Components that use external knowledge resources should delegate access to a resource adapter instead of handling it themselves |
abstract |
deprecated |
||
Metrics for the confidence level of the TDM operation should be included in the metadata |
abstract |
deprecated |
||
Metrics for the performance of the TDM operation should be included in the metadata |
abstract |
deprecated |
||
The metadata can include the information on which projects/workflows involve the resource |
abstract |
deprecated |
||
Statistical metadata that allow monitoring of resource versions may accompany resources |
abstract |
final |
||
Information on funding of resources may be included in the metadata |
abstract |
final |
recommended (51)
ID | Requirement | Concreteness | Status | WG’s |
---|---|---|---|---|
Components should provide documentation describing their functionality |
abstract |
final |
||
Citation information for component should be included in the metadata |
abstract |
draft |
||
Human readable information should be provided by each resource |
abstract |
deprecated |
||
Models/resources should be useable across different component collections/platforms |
abstract |
final |
||
Components should be stateless |
concrete |
final |
||
Workflows should be described using an uniform language |
abstract |
deprecated |
||
Workflow engines should not require to see data |
concrete |
deprecated |
||
Configuration and parametrizable options of the components should be identified and documented |
abstract |
final |
||
The Workflow Engine Should Permit Saving Experimental Conditions in a Workflow |
abstract |
deprecated |
||
The Workflow Engine should permit Licence Aggregation in Workflows |
abstract |
deprecated |
||
Incorporation of multiple resources in parallel |
abstract |
deprecated |
||
It should be possible to determine the source of an annotation/assigned category |
abstract |
final |
||
Components should handle failures gracefully |
abstract |
deprecated |
||
Processing components should be downloadable |
abstract |
final |
||
Licensing information should be expressed in a machine-readable form |
abstract |
final |
||
Classification metadata should be included, where applicable, in the metadata record of the resource |
abstract |
final |
||
Information on the structural annotation (layout) of resources should be included in the metadata of the resource |
abstract |
final |
||
Metadata of tools should contain information about the models available for them |
abstract |
deprecated |
||
Documentation references should be versioned |
abstract |
final |
||
License should be attached |
abstract |
draft |
||
License information must be in metadata |
abstract |
deprecated |
||
Licensor must be entitled to grant license |
abstract |
draft |
||
Licensees should remain with a copy of the license |
abstract |
draft |
||
Standard licenses should be used |
abstract |
draft |
||
License should be machine readable |
abstract |
draft |
||
License should be understandable by non-lawyers |
abstract |
draft |
||
TDM must be explicitly allowed |
abstract |
draft |
||
Right for (temporary) reproduction must be granted |
abstract |
draft |
||
Boundary for derivative work must be clearly defined |
abstract |
draft |
||
No restrictions on TDM results which are not derived works |
abstract |
draft |
||
World-wide and irrevocable license grant |
abstract |
draft |
||
License must qualify for Open Access rights |
abstract |
draft |
||
License must qualify for Open Access uses |
abstract |
draft |
||
License must qualify for Open Access must not restrict use in any way |
abstract |
draft |
||
License must qualify for Open Access may include attribution requirements |
abstract |
draft |
||
Knowledge Resource Element Id |
abstract |
final |
||
Data Category Linking Vocabulary |
abstract |
final |
||
Interoperability between elements from different knowledge resource schemas should be expressed through RDF statements. |
abstract |
final |
||
The KR should be ingestible through a URI |
abstract |
final |
||
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
abstract |
final |
||
Common elements to represent/describe an executable workflow |
concrete |
deprecated |
||
Attaching format properties to the description of the inputs and outputs that use file |
concrete |
deprecated |
||
Version documentation in parallel with component/resource |
concrete |
draft |
||
Encoding citable publications (for scholarly attribution) in resource metadata records |
concrete |
draft |
||
Including license text in resource packages |
concrete |
draft |
||
Provide identifiers for knowledge resource elements |
concrete |
draft |
||
Data Category Linking Vocabulary |
concrete |
draft |
||
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
concrete |
draft |
||
Making models and annotation resources accessible as entities distinct from the components they are compatible with |
concrete |
draft |
||
Encoding funding information in the metadata descriptions of all resources |
concrete |
draft |
||
Encoding metadata on domain/subject/ classification for all resources when applicable |
concrete |
draft |
By Concreteness
abstract (69)
ID | Requirement | Status | Strength | WG’s |
---|---|---|---|---|
Components must be described by machine-readable metadata |
final |
mandatory |
||
Component metadata have to be embedded into the component source code |
final |
mandatory |
||
Component metadata must be separable from the component |
final |
mandatory |
||
URL to actual content must be discoverable |
final |
mandatory |
||
Components must detail all their environmental requirements for execution |
draft |
mandatory |
||
Components should have a unique identifier and a version number |
draft |
mandatory |
||
Components must associate themselves with categories defined by the OpenMinTeD project |
final |
mandatory |
||
Components must declare their annotation schema dependencies |
final |
mandatory |
||
Components should specify the types of the annotations that they input and output |
draft |
mandatory |
||
Components must declare whether they can be scaled within a workflow |
draft |
mandatory |
||
Components should provide documentation describing their functionality |
final |
recommended |
||
Citation information for component should be included in the metadata |
draft |
recommended |
||
Components must maintain License information |
deprecated |
mandatory |
||
Human readable information should be provided by each resource |
deprecated |
recommended |
||
Models/resources should be useable across different component collections/platforms |
final |
recommended |
||
Workflows should be described using an uniform language |
deprecated |
recommended |
||
Components that use external knowledge resources should delegate access to a resource adapter instead of handling it themselves |
deprecated |
optional |
||
Configuration and parametrizable options of the components should be identified and documented |
final |
recommended |
||
The Workflow Engine Should Permit Saving Experimental Conditions in a Workflow |
deprecated |
recommended |
||
The Workflow Engine should permit Licence Aggregation in Workflows |
deprecated |
recommended |
||
Using/treating workflows as components |
deprecated |
mandatory |
||
Incorporation of multiple resources in parallel |
deprecated |
recommended |
||
It should be possible to determine the source of an annotation/assigned category |
final |
recommended |
||
Components should handle failures gracefully |
deprecated |
recommended |
||
Processing components should be downloadable |
final |
recommended |
||
The actual content of all content resources must be discoverable |
deprecated |
mandatory |
||
Metrics for the confidence level of the TDM operation should be included in the metadata |
deprecated |
optional |
||
Metrics for the performance of the TDM operation should be included in the metadata |
deprecated |
optional |
||
Version must be included in the metadata description for all resources |
deprecated |
mandatory |
||
Licensing information must be included in the metadata |
final |
mandatory |
||
Licensing information should be expressed in a machine-readable form |
final |
recommended |
||
All resources must include a unique persistent identifier |
deprecated |
mandatory |
||
Classification metadata should be included, where applicable, in the metadata record of the resource |
final |
recommended |
||
Information on the structural annotation (layout) of resources should be included in the metadata of the resource |
final |
recommended |
||
Access mode of resources must be included in the metadata |
final |
mandatory |
||
Content resources must include metadata on their format (e.g. XML, DOCX etc.) |
final |
mandatory |
||
Component metadata must include standardised categories/tags that make them easy to discover |
deprecated |
mandatory |
||
Content resources must include metadata on their language(s) |
final |
mandatory |
||
The metadata can include the information on which projects/workflows involve the resource |
deprecated |
optional |
||
S/W (tools, web services, workflows) must indicate whether they are language-independent or the language(s) of the resources they take as input and output |
final |
mandatory |
||
Statistical metadata that allow monitoring of resource versions may accompany resources |
final |
optional |
||
S/W (tools, web services, workflows) must indicate format of their output |
final |
mandatory |
||
Output resources of web services/workflows must be accompanied by provenance metadata |
deprecated |
mandatory |
||
Information on funding of resources may be included in the metadata |
final |
optional |
||
All resource metadata records must include a reference to the metadata schema used for their description |
deprecated |
mandatory |
||
Metadata of tools should contain information about the models available for them |
deprecated |
recommended |
||
Documentation references should be versioned |
final |
recommended |
||
License should be attached |
draft |
recommended |
||
License information must be in metadata |
deprecated |
recommended |
||
Licensor must be entitled to grant license |
draft |
recommended |
||
Licensees should remain with a copy of the license |
draft |
recommended |
||
Standard licenses should be used |
draft |
recommended |
||
License should be machine readable |
draft |
recommended |
||
License should be understandable by non-lawyers |
draft |
recommended |
||
TDM must be explicitly allowed |
draft |
recommended |
||
Right for (temporary) reproduction must be granted |
draft |
recommended |
||
Boundary for derivative work must be clearly defined |
draft |
recommended |
||
No restrictions on TDM results which are not derived works |
draft |
recommended |
||
World-wide and irrevocable license grant |
draft |
recommended |
||
License must qualify for Open Access rights |
draft |
recommended |
||
License must qualify for Open Access uses |
draft |
recommended |
||
License must qualify for Open Access must not restrict use in any way |
draft |
recommended |
||
License must qualify for Open Access may include attribution requirements |
draft |
recommended |
||
Knowledge Resource Element Id |
final |
recommended |
||
Data Category Linking Vocabulary |
final |
recommended |
||
Interoperability between elements from different knowledge resource schemas should be expressed through RDF statements. |
final |
recommended |
||
All KR content elements need to be added as text annotations within a TDM workflow. |
final |
mandatory |
||
The KR should be ingestible through a URI |
final |
recommended |
||
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
final |
recommended |
concrete (42)
ID | Requirement | Status | Strength | WG’s |
---|---|---|---|---|
Components must have a fully qualified name that follows the Java class naming conventions |
final |
mandatory |
||
Components should be stateless |
final |
recommended |
||
Workflow engines should not require to see data |
deprecated |
recommended |
||
Stick to widely used data compression formats |
draft |
best practice |
||
Machine-readable metadata for UIMA components |
draft |
mandatory |
||
Embedding UIMA component metadata into the source code |
draft |
mandatory |
||
Separating UIMA metadata from the component |
draft |
mandatory |
||
Unique identifiers and versions for UIMA components |
deprecated |
mandatory |
||
Specifying input and output types of UIMA components |
draft |
mandatory |
||
Documentation of UIMA components |
draft |
mandatory |
||
Common elements to represent/describe an executable workflow |
deprecated |
recommended |
||
Embedding GATE component metadata into the source code |
draft |
mandatory |
||
Unique identifiers and versions for GATE components |
deprecated |
mandatory |
||
Documentation of GATE components |
draft |
mandatory |
||
Separating GATE metadata from the component |
draft |
mandatory |
||
Unique identifier and version for components in the OMTD-SHARE schema |
deprecated |
mandatory |
||
Attaching format properties to the description of the inputs and outputs that use file |
deprecated |
recommended |
||
Embedding language capability in UIMA component metadata |
deprecated |
mandatory |
||
Embedding output format in UIMA component metadata |
draft |
mandatory |
||
Version documentation in parallel with component/resource |
draft |
recommended |
||
Components must be assigned at least one category from the OMTD-SHARE controlled vocabulary for component types |
draft |
mandatory |
||
Encoding citable publications (for scholarly attribution) in resource metadata records |
draft |
recommended |
||
Including license text in resource packages |
draft |
recommended |
||
Provide identifiers for knowledge resource elements |
draft |
recommended |
||
Data Category Linking Vocabulary |
draft |
recommended |
||
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
draft |
recommended |
||
Unique identifiers and versions for components using Maven |
draft |
mandatory |
||
Declaring scaleout capability in UIMA |
draft |
mandatory |
||
Publishing components via software repositories (Maven, Docker) |
draft |
mandatory |
||
Encoding in the metadata a direct access link for content resources |
draft |
mandatory |
||
Providing access to content resources (sharing/exposing and transferring) |
draft |
mandatory |
||
Making models and annotation resources accessible as entities distinct from the components they are compatible with |
draft |
recommended |
||
Adding version information in the metadata descriptions of all resources |
draft |
mandatory |
||
Specifying access mode of resources and encoding it in the metadata descriptions |
draft |
mandatory |
||
Encoding funding information in the metadata descriptions of all resources |
draft |
recommended |
||
Encoding of format in the metadata description of content resources |
draft |
mandatory |
||
Encoding licensing terms in the metadata description of the resource |
draft |
mandatory |
||
Encoding metadata on domain/subject/ classification for all resources when applicable |
draft |
recommended |
||
Encoding language information in the metadata of content resources |
draft |
mandatory |
||
Encoding statistical information in the content resources |
draft |
mandatory |
||
Assigning a unique persistent identifier for all resources |
draft |
mandatory |
||
Annotation schema dependencies for UIMA components using Maven |
draft |
mandatory |
Compliance
By Product
Numbers exclude deprecated requirements.
ARGO (44)
Compliance | # | % |
---|---|---|
Full |
9 |
20 |
No |
15 |
34 |
Partial |
20 |
45 |
Agrovoc (26)
Compliance | # | % |
---|---|---|
Full |
21 |
81 |
No |
4 |
15 |
Partial |
1 |
4 |
Alvis (37)
Compliance | # | % |
---|---|---|
Full |
7 |
19 |
No |
15 |
41 |
Partial |
15 |
41 |
CLARIN CCR (8)
Compliance | # | % |
---|---|---|
Full |
4 |
50 |
No |
4 |
50 |
ID | Requirement | Compliance |
---|---|---|
67 |
Full |
|
68 |
No |
|
69 |
No |
|
71 |
No |
|
72 |
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
Full |
93 |
Full |
|
94 |
No |
|
95 |
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
Full |
CORE (18)
Compliance | # | % |
---|---|---|
Full |
7 |
39 |
No |
1 |
6 |
Partial |
10 |
56 |
DKPro Core (46)
Compliance | # | % |
---|---|---|
Full |
26 |
57 |
No |
6 |
13 |
Partial |
14 |
30 |
Frontiers (18)
Compliance | # | % |
---|---|---|
Full |
9 |
50 |
No |
5 |
28 |
Partial |
4 |
22 |
GATE (36)
Compliance | # | % |
---|---|---|
Full |
13 |
36 |
No |
14 |
39 |
Partial |
9 |
25 |
ILSP (44)
Compliance | # | % |
---|---|---|
Full |
14 |
32 |
No |
15 |
34 |
Partial |
15 |
34 |
JATS (13)
Compliance | # | % |
---|---|---|
Full |
2 |
15 |
N/A |
1 |
8 |
No |
5 |
38 |
Partial |
5 |
38 |
ID | Requirement | Compliance |
---|---|---|
4 |
Partial |
|
33 |
Partial |
|
38 |
No |
|
41 |
Content resources must include metadata on their language(s) |
No |
50 |
No |
|
67 |
Partial |
|
68 |
No |
|
69 |
N/A |
|
71 |
Full |
|
72 |
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
Full |
93 |
Partial |
|
94 |
No |
|
95 |
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
Partial |
LAPPS (25)
Compliance | # | % |
---|---|---|
Full |
12 |
48 |
No |
12 |
48 |
Partial |
1 |
4 |
Licences (4)
Compliance | # | % |
---|---|---|
Full |
1 |
25 |
Partial |
2 |
50 |
Unknown |
1 |
25 |
ID | Requirement | Compliance |
---|---|---|
33 |
Full |
|
34 |
Licensing information should be expressed in a machine-readable form |
Partial |
50 |
Partial |
|
89 |
Unknown |
OLiA (25)
Compliance | # | % |
---|---|---|
Full |
12 |
48 |
No |
12 |
48 |
Partial |
1 |
4 |
Ontolex (8)
Compliance | # | % |
---|---|---|
Full |
8 |
100 |
ID | Requirement | Compliance |
---|---|---|
67 |
Full |
|
68 |
Full |
|
69 |
Full |
|
71 |
Full |
|
72 |
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
Full |
93 |
Full |
|
94 |
Full |
|
95 |
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
Full |
OpenAIRE (18)
Compliance | # | % |
---|---|---|
Full |
2 |
11 |
No |
4 |
22 |
Partial |
12 |
67 |
TheSoz (26)
Compliance | # | % |
---|---|---|
Full |
20 |
77 |
No |
4 |
15 |
Partial |
2 |
8 |
schema.org (8)
Compliance | # | % |
---|---|---|
Full |
7 |
88 |
No |
1 |
13 |
ID | Requirement | Compliance |
---|---|---|
67 |
Full |
|
68 |
Full |
|
69 |
No |
|
71 |
Full |
|
72 |
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
Full |
93 |
Full |
|
94 |
Full |
|
95 |
The KR format should be in a standard format such as XML, JSON-LD or RDF/XML. |
Full |
Without justification
ID | Requirement | Product | Compliance |
---|---|---|---|
Interoperability between elements from different knowledge resource schemas should be expressed through RDF statements. |
TheSoz |
Full |
Requirements
1. Components must be described by machine-readable metadata
Concreteness: abstract
Strength: mandatory
Status: final
Category: WG4
In order to incorporate components into the platform in an efficient manner, the platform must be able to automatically obtain information about them. Hence, components must provide machine-readable metadata by which they describe themselves.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Full |
Alvis component descriptors |
Draft |
ARGO |
0.5 |
Full |
UIMA component descriptors |
Draft |
DKPro Core |
1.8.0 |
Full |
UIMA component descriptors, Maven project descriptors |
Draft |
GATE |
8.2 |
Full |
CREOLE descriptors |
Draft |
ILSP |
1.2.1 |
Full |
UIMA component descriptors |
Draft |
2. Component metadata have to be embedded into the component source code
Concreteness: abstract
Strength: mandatory
Status: final
Category: WG4
To avoid implementation and metadata getting out-of-sync, the metadata have to be closely integrated with the component source code such that e.g. parameter names and types are obtained directly from the implementation (cf. [uimafit]).
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Alvis XML descriptors manually maintained |
Draft |
ARGO |
0.5 |
No |
(I believe) - UIMA XML descriptors manually maintained |
Draft |
DKPro Core |
1.8.0 |
Full |
using uimaFIT Java annotations to automatically generate UIMA XML descriptors |
Draft |
GATE |
8.2 |
Partial |
using CREOLE Java annotations, but not yet in all components |
Draft |
ILSP |
1.2.1 |
Partial |
Both UIMA XML descriptors necessary for UIMA-AS integration, but also uimaFIT Java annotations to integrate components in command line pipelines |
Draft |
3. Component metadata must be separable from the component
Concreteness: abstract
Strength: mandatory
Status: final
Category: WG4
The component metadata should be provided in such a way that it is separable from the component. I.e. despite the canonical source of much of the metadata being the source code (and in a second instance, the compiled code), it should not be necessary to inspect the source code (or compiled code) to access the metadata or to actually invoke the component. Instead, the component should be accompanied by a metadata file that was automatically generated during build time. Component repositories for example should use this file when looking for component metadata. The file should be at a well-known location within the component artifact.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Full |
Alvis XML exist separately from the code. Each XML descriptor is next to the component class it describes. By convention, the name of the XML file is created by appending the keyword "Doc" to the end the component name, for example GeniaTaggerDoc.xml for the component named GeniaTagger. |
Draft |
ARGO |
0.5 |
Partial |
UIMA XML descriptors manually maintained |
Draft |
DKPro Core |
1.8.0 |
Full |
using uimaFIT to automatically generate UIMA XML descriptors. XML descriptors as always next to
the respective component classes. Pointers to all XML descriptors as stored in at
|
Draft |
GATE |
8.2 |
Partial |
only for those components that still use CREOLE descriptors and do not use CREOLE Java annotations. It is, however, trivial to produce separate versions as GATE contains an ANT task to write CREOLE descriptors to disk from the Java annotations and it would be trivial to add this to the build process of each plugin. |
Draft |
ILSP |
1.2.1 |
Full |
Maven POM files embedded in JARs under |
Draft |
4. URL to actual content must be discoverable
The URL of the actual content, e.g. a text file, PDF file, or other common data format, must be derivable from readily discoverable data. It must either be included in the metadata or it must be reliably constructable from the metadata URL, e.g. by appending or changing a suffix. Derivation rules must be clearly documented.
Note
|
This requirement refers only to content resources that can be used as input for a TDM process (i.e. corpora of publications etc.), or to knowledge resources (e.g. annotation schemas, typesystems, grammars etc.) that are used as ancillary resources for the operation of TDM components. The requirement is set for the proper operation of the TDM components in so far as they need to easily access the resources they operate on or with. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Actual content (either pdf, text, or other) is available via constructing URL based on identifiers (internal CORE id, oai identifier, doi ) provided in the metadata record |
Final |
OpenAIRE |
Jun-16 |
Partial |
There’s a link to the doi but not always; and usually the doi sends to a landing page, but there are some rules that can be used for "guessing" the download link |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Full |
From XMLS, the URL can be reliably constructed using the DOI. |
Final |
TheSoz |
Jun-16 |
Full |
Url to the resource and the SPARQL endpoint is provided in the void.ttl file (Should be located at: http://lod.gesis.org/thesoz/void.ttl however it is not at the moment available online due to technical issues). |
Final |
Agrovoc |
21/01/2016 |
Full |
Url to the two versions of the resource and the SPARQL endpoint is provided in the void.ttl file (http://aims.fao.org/aos/agrovoc/void.ttl) |
Final |
JATS |
1.1 |
Partial |
although the URL to the JATS schemas (as DTDS, Relax NG & XSD) and elements is available at stable URIs, each at a separate folder (info is provided at https://jats.nlm.nih.gov/files.html), there is no formal metadata to indicate this |
Final |
OLiA |
Jun-16 |
Full |
all resources are at the landing page: http://acoli.cs.uni-frankfurt.de/resources/olia/ |
Final |
LAPPS |
Jun-16 |
Full |
in various formats, available from http://vocab.lappsgrid.org/ |
Final |
5. Components must detail all their environmental requirements for execution
Concreteness: abstract
Strength: mandatory
Category: WG4
Status: draft
Text and Data Mining (TDM) components are not always self-contained entities; for example, a component maybe a wrapper to another library developed in a language not supported by the TDM framework. This library may have additional system dependencies that must be available at runtime - think of a Java TDM component wrapping a python library; the python library will require Python and potentially other Python libraries. Knowing the environmental requirements allows a workflow execution system to ensure that they are met before the component is initialised.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Some components may need external libraries or resources. Alvis works better in a Linux environment. |
Draft |
ARGO |
0.5 |
Partial |
All components are distributed as UIMA PEAR files written in Java, with some components requiring platform-specific binaries. Platform-specific binaries are provided for OSX, Windows and Linux. This appears to be the same behaviour as DK Pro. When running a workflow on a cluster/cloud, Argo requires an expected maximum memory limit per component instance in a workflow, and this is not included in the component’s metadata - it has to be entered manually by users. |
Draft |
DKPro Core |
1.8.0 |
Partial |
All components are in Java, some require platform-specific binaries which are provided usually for Linux, Windows, and OS X in a pre-compiled form. There is no metadata explaining which components rely on binaries or about the supported platforms and OS versions. E.g. some binaries don’t work on modern Linuxes (hunpos) others do not work on old versions. The required Java version is available through a property defined in the POM. |
Draft |
GATE |
8.2 |
Partial |
All components are written in Java although some may make use of external platform-specific binaries. Where this is the case either the binaries are provided as part of the plugin, or documentation is provided explaining how to obtain and install the appropriate components. There is no metadata to highlight the need for external components, although the metadata can (and usually does) provide a URL to the documentation where it would be discussed in detail. |
Draft |
ILSP |
1.2.1 |
Full |
All components are in Java. The required Java version is available through a property defined in the POM. |
Draft |
6. Components should have a unique identifier and a version number
Concreteness: abstract
Strength: mandatory
Category: WG4
Status: draft
Components should contain an identifier by which they can be distinguished from each other, in addition to their version. A component registry would then be able to use the combination of the identifier and version number to produce resolvable URLs from which components can be retrieved.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Components (instances of components) are well identified in a workflow but component versioning is not maintained in Alvis. |
Draft |
ARGO |
0.5 |
Partial |
Each component has a unique id under the |
Draft |
DKPro Core |
1.8.0 |
Partial |
Each component has a unique name under the |
Draft |
GATE |
8.2 |
Partial |
Each component is a Java class and as such has a unique name. Currently, however, there is no requirement for a GATE plugin or component to have a version number. In most cases we distribute GATE with a large set of plugins which are assumed to share the same version number as the framework itself. Plugins (which may contain multiple components) distributed via an update site must include a version number and all components inside that plugin share that. Experience has shown that this is far from ideal and our future plans for GATE include moving to using Maven to distribute plugins in which case each plugin will be uniquely identified by a set of Maven coordinates, although support for loading legacy plugins without a version number will remain for the foreseeable future. |
Draft |
ILSP |
1.2.1 |
Partial |
Each component has a unique name under the |
Draft |
7. Components must have a fully qualified name that follows the Java class naming conventions
Concreteness: concrete
Strength: mandatory
Category: WG4
Status: final
Using such a widely adopted naming convention assists component developers through familiarity and helps avoid naming collisions due to the use of domain names, under the control of the developers, as part of the identifier. It is also the recommended method of identifying components in existing text mining systems such as Argo and GATE.
Note
|
This requirement was previously titled "Components should be uniquely identified using a string that follows the Java fully-qualified class naming convention". In particular the word "uniquely" has been removed because a Java fully-qualified name does not include version information and in principle there is no safe-guard against having multiple Java classes by the same name on the classpath at the same time. The main point here is that users name their components within a namespace that can be considered under their control. The Java conventions then provide a best-practice how to choose and structure such a namespace. Additional best-practices and techniques, e.g. the Maven best-practices and version-resolving techniques then should ensure that only a single version of a given component is available at runtime. If a system needs to use multiple versions of a component, additional precautions may need to be necessary. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Full |
Components are Java classes organised through packages. |
Draft |
ARGO |
0.5 |
Partial |
All of our current components are identified by strings that follow the Java fully-qualified class naming convention, however as the component identifier within Argo is derived from a filepath it is entirely possible for this requirement to be broken by new components. For example, the UIMA descriptor for the Argo component Chemical Entity Recogniser has the path |
Draft |
DKPro Core |
1.8.0 |
Full |
Components are identified by a Java class name. Per Java/Maven conventions, this classname is unique within DKPro Core. There is also a version associated with each class corresponding to the version of the enclosing Maven project/JAR. |
Draft |
GATE |
8.2 |
Full |
All GATE components are implemented as Java Beans and as such fulfill this requirement. |
Draft |
ILSP |
1.2.1 |
Full |
Components are identified by a Java class name. They are by convention unique within nlp.ilsp.gr. |
Draft |
8. Components must associate themselves with categories defined by the OpenMinTeD project
Concreteness: abstract
Strength: mandatory
Category: WG4
Status: final
With a large number of components, it can be difficult for workflow creators to find the most appropriate components if they can only search for components by name or by reading each component’s documentation. Having components associated with categories would enable workflow creators to easily search for a particular type of component (e.g. part-of-speech tagger) as well as allowing associated UI tooling an additional way of ordering lists of components on-screen.
To be moved to FR in WP6
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Alvis does not categorize the components but can follow a categorization schema. |
Draft |
ARGO |
0.5 |
Partial |
Currently Argo puts components into 4 categories (reader, writer, analytics, web service), however the bulk of the components reside within the analytics category, and the only method for users to find a relevant component is to search through the entire list within the UI. The reader, writer and analytics components are identified by examining their UIMA descriptor files (looking for either the collectionReaderDescription, casConsumerDescription and analysisEngineDescription elements). Interactive components are identified by the |
Draft |
DKPro Core |
1.8.0 |
No |
DKPro Core components try to follow a naming scheme, e.g. |
Draft |
GATE |
8.2 |
No |
GATE does not currently enforce such a categorization (or have any way internally of doing so). For the core plugins (i.e. those developed in Sheffield) we try and stick to a naming convention that gives some idea of the nature of the plugin; Lang_X, Parser_X, Tagger_X, etc. but this is purely a convention and is in no way fine grained enough to be used for indexing or other forms of automatic discovery. |
Draft |
ILSP |
1.2.1 |
No |
By convention we package all readers, exporters, and instantiations of uimafit pipelines as three different projects. A package is also used for uima service clients. The rest of the packages concern analyzers and follow an ilsp-nlp-XXX naming scheme, with XXX being a string like lemmatizer or depparser. |
Draft |
9. Components must declare their annotation schema dependencies
Concreteness: abstract
Strength: mandatory
Category: WG4
Status: final
A workflow execution system will need to know which annotation schemas to load when running a component within a workflow. Not having this information would result in the system having to load all of the annotation schemas it knows about at runtime for each component or force the workflow creators to manually specify them, both highly inefficient options. This information also forms part of the components documentation, assisting workflow creators.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Work in progress. |
Draft |
ARGO |
0.5 |
Full |
Argo components must declare the type systems they depend upon within their UIMA descriptor file. |
Draft |
DKPro Core |
1.8.0 |
Partial |
Components live within Maven artifacts. These declare dependencies on other Maven artifacts which contain the type system definitions. Most component declare input and output types. Thus, the XML type system descriptors for a component can be obtained by scanning the Maven dependencies for XML type descriptors that declare types which are used as input/output types by a given component. |
Draft |
GATE |
8.2 |
No |
GATE does not require components to specify annotation schemas, and has no way (currently) of recording that information even if it were available. This is because GATE does not enforce a type system. This does mean that no annotation schemas have to be loaded for a GATE component to execute. This lack of an enforced type system is a feature of GATE and not a bug and so is unlikely to change, meaning any requirement to specify schemas per component is likely to only ever be optional within GATE. |
Draft |
ILSP |
1.2.1 |
Partial |
UIMA components declare the type system they depend upon within their descriptor file. The type system is a Maven dependency of all readers, analyzers and exporters. Components declare input and output types. |
Draft |
10. Components should specify the types of the annotations that they input and output
To assist in the creation of workflows, it would be helpful to know what types of annotations are required and produced by components so that UI tooling can display appropriate warnings when a component is added in an invalid position within a workflow or to help diagnose runtime exceptions whilst running a workflow.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
The type system is shared by all the components. |
Draft |
ARGO |
0.5 |
Partial |
Some components declare the types that they input and output, within the component’s UIMA descriptor, however this is not always the case and is not required by Argo. |
Draft |
DKPro Core |
1.8.0 |
Full |
Most components declare the input and output types using UIMA capabilities. For some components, the types they operate on are configured through parameters. |
Draft |
GATE |
8.2 |
Partial |
GATE does not enforce any type system and as such does not require components to declare in any way their input/output types. That being said it does support XML schemas for defining annotations (and their features) and so it would be possible to bundle a set of schemas with a component (i.e. within the JAR file) that defined the expected input/output annotations. For example, the ANNIE plugin already comes with schemas which define the main NE annotations produced by the application (GATE’s terminology for a workflow). We also have a component, the Schema Enforcer, that can be used to ensure that the output of an application strictly conforms to a given set of schemas removing any annotations or features not defined by the schemas. The manual annotation aspects of GATE Developer can also be configured to only allow annotations/features defined by XML schemas to be created. |
Draft |
ILSP |
1.2.1 |
Partial |
Components declare the input and output types using UIMA capabilities. |
Draft |
11. Components must declare whether they can be scaled within a workflow
Concreteness: abstract
Strength: mandatory
Category: WG4
Status: draft
To accommodate scaling of a workflow, it is imperative to understand which components can be deployed multiple times. This is something that the developer will have to specify manually and components that cannot be scaled should be exception rather than the rule.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Do not deal with this aspect. |
Draft |
ARGO |
0.5 |
Full |
Components use the UIMA flag |
Draft |
DKPro Core |
1.8.0 |
Full |
Components use the UIMA flag |
Draft |
GATE |
8.2 |
Full |
Being Java Beans a GATE component is inherently single threaded, although there is no reason why a component cannot be instantiated multiple times within or across JVM’s. It is, however, worth pointing out that within a JVM naively creating multiple instances of a component is probably not a good idea as this is likely to be wasteful in terms of memory as multiple instances of large data structures might be created. GATE contains methods to duplicate components and entire workflows in a more sensible fashion to share internal structures where safe to do so. |
Draft |
ILSP |
1.2.1 |
Full |
Most of the analyzers have been deployed as UIMA-Asynchronous Scaleout services and as parts of UIMA-AS aggregate services without any obvious issues. The scaleout capabilities of the UIMA-AS framework and of DUCC have not been fully explored. |
Draft |
12. Components should provide documentation describing their functionality
Concreteness: abstract
Strength: recommended
Category: WG4
Status: final
Having some form of free-text description assists text and data miners in choosing the most appropriate components for their workflows and potentially boosts productivity, as it possibly reduces the need for experimentation if a component has configuration parameters.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Components have descriptions as html streams containing formatting tags. |
Draft |
ARGO |
0.5 |
Partial |
Components can declare a textual description using the |
Draft |
DKPro Core |
1.8.0 |
Full |
Most components have some kind of JavaDoc-based description. This is carried over into UIMA descriptors and Java annotations using the uimafit-maven-plugin during build time. This makes the descriptions available through Java reflection as well as in the UIMA XML descriptors. We use these descriptors as part of auto-generating the DKPro Core reference documentation. The quality of the documentation varies. |
Draft |
GATE |
8.2 |
Full |
The CREOLE metadata that describes a component includes the name and a brief description as well as a URL to the full documentation of the component. None of these are, however, mandatory and documentation varies across components (i.e. we have no control of the documentation of components developed by 3rd parties) |
Draft |
ILSP |
1.2.1 |
Partial |
Most components have a minimal textual description in the UIMA descriptor file, which is often replicated in the Maven POM. If applicable, a reference to a scientific article is also included. |
Draft |
13. Citation information for component should be included in the metadata
It is important that citable publications for each component can be obtained from the component metadata.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
No citation information in general, the documentation of the modules sometimes contains citation information. |
Draft |
ARGO |
0.5 |
No |
No citable publication information available. |
Draft |
DKPro Core |
1.8.0 |
No |
A few components contain references to relevant publications in their documentation, but it is not machine readable. |
Draft |
GATE |
8.2 |
No |
GATE does not support this, although it’s likely that the URL to the documentation may well lead to a publication or documentation that references one. |
Draft |
ILSP |
1.2.1 |
Partial |
Available for some components but not always applicable or used. |
Draft |
14. Components must maintain License information
Concreteness: abstract
Strength: mandatory
Category: WG4
Status: deprecated
Important
|
This requirement has been deprecated. Its is covered by REQ-33, REQ-34 and expected functionality of the OpenMinTeD platform. |
It is important that not only is a license strongly assigned to components and resources but that this information is passed upwards through workflows etc. so that a license can be assigned to an aggregate or to newly created resources that result. To this end storing license information within component metadata seems the most sensible way forward.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
The licence information are currently provided as free texts in the tool documentation. |
Draft |
ARGO |
0.5 |
Partial |
Components can declare licensing information with the Argo Descriptor File, however the vast majority of components do not make use of this facility. |
Draft |
DKPro Core |
1.8.0 |
Partial |
DKPro Core components have license metadata about their own code in their POM hierarchy. It is also available for those dependencies that provide license information in their respective POMs (not all dependencies declare such license information in their POMs). For most models/resources, we currently have no license information. |
Draft |
GATE |
8.2 |
No |
Currently GATE components do not have a license assigned as part of the metadata. The next version of GATE will use maven for plugin/component distribution and will have access to license information documented in the pom.xml for a plugin. |
Draft |
ILSP |
1.2.1 |
Partial |
ILSP components do not have a license assigned as part of the metadata. ILSP components are provided free for research purposes via UIMA, SOAP and/or rest services. The license information for these services is declared via appropriate metadata as for example in https://goo.gl/yDynbu |
Draft |
15. Human readable information should be provided by each resource
Important
|
This requirement has been deprecated. It is overlapping with REQ-12, WG1-5 and is not sufficiently clear. |
Capturing, computing and presenting simplified information (e.g., availability, reliability, QoS, provider, class) about resources bring decisional information to best choose a specific resource compared to others. For example, based on aspects such as availability or QoS, one may create a specific view to filter out some specific services.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Available for the common features (e.g., name, description, provider,…) |
Draft |
ARGO |
0.5 |
Full |
Argo components have a human readable description. |
Draft |
DKPro Core |
1.8.0 |
Full |
Components should come with a description (primarily obtained from the JavaDoc of the component). That information may not be particularly comprehensive though. Availability, reliability, and the other example information given above are not relevant for DKPro Core as it is not service-based. |
Draft |
GATE |
8.2 |
Full |
Each component has a human readable name and description as part of the metadata |
Draft |
ILSP |
1.2.1 |
No |
No availability or QoS information is provided for components available as web services. Readable names, descriptions, Terms Of Service, provider etc. are provided as metadata for these services. |
Draft |
16. Models/resources should be useable across different component collections/platforms
Concreteness: abstract
Strength: recommended
Category: WG4
Status: final
Different platforms/component collections that wrap the same tools should be able to be configured in similar or uniform ways to use models/resources. E.g. all wrappers should permit loading models from an arbitrary location (on disk, JAR, URL), and not require that they be packaged with the component. Support for a common repository infrastructure from which to obtain models/resources automatically would be beneficial.
Note
|
Maven is sometimes used today to store models/resources. Although Maven is largely popular in the Java world, there is e.g. also a client implementation for Python which would allow e.g. NLTK to use models stored on Maven repositories (jip). |
Source: WG 4 Scenario 1 — Transferability of components between ecosystems
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Alvis has a stream manager that enables to characterize models/resources to load or to store, it supports file system, web access, zip/jar,… and is extensible. |
Draft |
ARGO |
0.5 |
Partial |
Argo components that offer configurable models/resources make use of standard UIMA configuration parameters. The majority of these configuration parameters accept a local file path, which relates to a user’s file store within the Argo ecosystem. |
Draft |
DKPro Core |
1.8.0 |
Full |
DKPro Core components that use resources can typically be configured through two mechanisms: 1) via the language/variant coordinates which internally translate to a classpath lookup, and may make use of the built-in auto-download mechanism for models. 2) via the PARAM_MODEL_LOCATION which supports locally available resources (file system, classpath, ZIP/JAR files, but not arbitrary remote URLs). |
Draft |
GATE |
8.2 |
Full |
GATE components that are configured through models/resources etc. simply require a URL to the resources. This could be a file or jar URL as well as a normal http URL. Resources may be bundled inside jar files for distribution but these are simply the default files and can always be replaced. |
Draft |
ILSP |
1.2.1 |
Partial |
ILSP UIMA-based components are typically configured via uimaFIT ConfigurationParameter and/or ExternalResource annotations. As services, users can only configure parameters things like exporting format. |
Draft |
17. Components should be stateless
Concreteness: concrete
Strength: recommended
Category: WG4
Status: final
To enable scaling of components, using networked machines, it should be recommended that a component avoids any form of local state with regards to the processing of a document.
Note
|
If a component is required to be stateful, it would make sense for the platform to offer a solution to allow scaling of stateful components. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
In a general way, there is no local state for a component and determinism is guaranteed. However, aspects such as resource availability and resource access may imply state issues that Alvis core engine has to manage. |
Draft |
ARGO |
0.5 |
Partial |
Most Argo components are stateless. It is possible to indicate that a component is stateful (which will result in only a single instance of that component being permitted within a distributed execution) by setting the multipleDeploymentAllowed value to false in the UIMA descriptor file. Argo also automatically treats a subset of components as being stateful, which includes Reader, Writers and Interactive Components. |
Draft |
DKPro Core |
1.8.0 |
Partial |
Most DKPro Core components are stateless in the described sense: no information needs to be shared between multiple instances of the same component. There is the case of sharing models between multiple instances and this is experimentally supported by DKPro Core but can only be used if the models themselves are threadsafe/stateless. Note that components are not threadsafe! A single component instance cannot be safely invoked from multiple threads! A notable exception to statelessness are writer components which are able to aggregate information, e.g. writing all pipeline output to a single file. Such components cannot be sensibly scaled out. They are marked with the UIMA operational property multipleDeploymentAllowed = false. |
Draft |
GATE |
8.2 |
No |
never - GATE components are all Java beans. This means that the document to process is provided to the component via a set method making it part of the state of the component. Multiple instances of a component can of course be created to enable processing of multiple documents in parallel. Naively creating multiple instances should, however, be avoided as this is wasteful in terms of memory as it requires multiple copies of internal state that can be shared. The framework supports intelligent duplication of components and applications to share appropriate state across threads. |
Draft |
ILSP |
1.2.1 |
Full |
ILSP components deployed as services avoid any form of local state. Multiple instances of a service deployed on the same machine can share lexical resources and models. |
Draft |
18. Workflows should be described using an uniform language
Concreteness: abstract
Strength: recommended
Category: WG4
Status: deprecated
Important
|
This requirement has been deprecated. It should be covered by the functionality of the OpenMinTeD platform, namely Galaxy workflow editor. |
A workflow represents an experiment that users describe, visualize, execute, modify and shared. To facilitate the lifecycle, the workflows should be described in a uniform way so that the instantiated components, parameters and data flow to be understandable by users as well as by the engines.
Note - we may wish to be more specific with this requirement
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Alvis has a workflow representation that only accepts sequence of modules and parameter configuration on the individual modules. The workflows are expressed using XML constructs that the Alvis engine can interpret. The workflow expression can be modified, transfered and executed in different Alvis instances. The workflow representation is however only used within Alvis. |
Draft |
ARGO |
0.5 |
Partial |
Possibly misinterpreting this requirement, but Argo contains a visual web-based workflow editor which allows a user to easily compose a workflow using components, which is then translated into a format understood by the execution engine within Argo. Argo is only partially compliant as the workflows it produces cannot currently be shared outside of the Argo platform. |
Draft |
DKPro Core |
1.8.0 |
Unknown |
DKPro Core is build on UIMA and can be invoked through UIMA by any workflow manager or execution environment that supports UIMA. We do not really care whatever language that environment is using to describe their workflows. Many users like using DKPro Core components with uimaFIT (often either using SimplePipeline or CpeBuilder) - DKPro Core provides examples for this in various languages (Java, Groovy, Jython). LAPPS Grid integrates services based on DKPro Core components with their Galaxy workflow engine. |
Draft |
GATE |
8.2 |
Unknown |
GATE has a standard description for workflows (as will any existing framework) but I really don’t understand what this requirement is asking. |
Draft |
ILSP |
1.2.1 |
Unknown |
ILSP components are combined in workflows using UIMA (AS) aggregate descriptors and/or using the workflow functionalities of processing infrastructures like CLARIN and METASHARE/QT21. |
Draft |
19. Components that use external knowledge resources should delegate access to a resource adapter instead of handling it themselves
Important
|
This requirement has been deprecated following a discussion in WG4 that came to the conclusion the proposed approach would be over-engineering. Taking a gazetteer as an example, it was considered the main functionality of a specific gazetteer implementation to perform lookups. So there may be a DatabaseGazeteer, a FileGazeteer, or a SPARQLGazeteer. Adding another layer of abstraction to create a GenericGazeteer to be configured with a DatabaseLookup or FileLookup was not seen as important. |
A component that uses an external knowledge resources, e.g. a dictionary, should access that resource through an adapter instead of accessing it directly. E.g. a Gazeteer component could be configured with a WordListFileResourceAdapter that looks up words from a simple word list file or using some SparqlQueryAdapter that would perform the lookup via Sparql. Thus, providing access to different knowledge sources should not require changes to the component implementation itself, but rather require only the implementation of lightweight adapters. It is not important that all implementations use the same adapters, but rather that this abstraction exists in the first place. E.g. a GATE component may use a GATE-specific adapter implementation and a UIMA component may use an UIMA-specific implementation.
Note
|
UIMA External Resources could be used for this and I am pretty sure GATE also has one or more abstraction for such things. |
Source: WG 2 Scenario 3 — The relation between documents and knowledge bases through keywords
Required actions: While UIMA has the general abstraction, there are no common implementations for different types of knowledge resources. I think GATE has some typical/common abstractions like a gazetteer and probably others. These should be inspected and additional recommendations for specific resource types may be given in future requirements. Mind that even if GATE/UIMA have such abstractions, it does not mean that they are consistently used or used at all. This should be brought to the attention of the component collection providers.
Note
|
This requirement may not be necessary in small-scale resource lookup. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Unknown |
Unknown |
Draft |
ARGO |
0.5 |
Unknown |
Components developed for Argo can make use of UIMA’s external resources feature, however I don’t believe there are many (if any) Argo components that do. On the Argo development roadmap, we plan on using this approach with Argo reader and writer components. Currently for every serialisation format (e.g. CAS XMI, RDF/XML) and storage platform (e.g. Argo File Store, SFTP) combination we require a new component to be created. Instead, we plan to allow the addition of serialisation formats and storage platforms independently. |
Draft |
DKPro Core |
1.8.0 |
Partial |
We experimented with such an approach e.g. in the SemanticFieldAnnotator where a resource abstract for semantic tagging resources exists. The idea was e.g. to read semantic tags either from a plain text file or alternatively from a Uby database. The ResourceObjectProvider abstraction internally used by DKPro Core to resolve and load models is also kind of related to this requirement. E.g. there is an experimental function to enable a behind-the-scene sharing of loaded models between multiple component resources. This works provided the models themselves are thread-safe. |
Draft |
GATE |
8.2 |
Unknown |
Unknown |
Draft |
ILSP |
1.2.1 |
Partial |
Some of the classes that provide access to resource data for ILSP components implement the org.apache.uima.resource.SharedResourceObject interface. |
Draft |
20. Workflow engines should not require to see data
Important
|
This requirement has been deprecated. It continues to exist as the functional requirement FS/WFEX/07. |
When assembling a workflow and running it, the workflow engine should not require that all data be transferred between the components by passing it through the engine. In the best case, this would only incur a performance overhead. In the worst case, it would require potentially sensitive or proprietary data to leave its source infrastructure, even if the actual analysis components deployed would be deployed on the same infrastructure (that is assuming that the workflow engine does not also run on the same infrastructure). This could e.g. be solved by processing components talking directly to each other.
Source: WG1 Scenario 2 — SME running research analytics for funders within the European Research Area
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Alvis uses a shared data structure, the workflow engine is required to manage the data and the modules during the execution steps. However the Alvis modules could be encapsulated with the Alvis engine in a OpenMinTeD workflow. |
Draft |
ARGO |
0.5 |
Partial |
On certain infrastructures with shared storage, it is possible for data to bypass the workflow engine. The Argo development roadmap does contain a plan to allow components to be speak directly to one another. |
Draft |
DKPro Core |
1.8.0 |
N/A |
DKPro Core itself does not provide workflow functionality. It’s components can in principle be used in a workflow setup where data is passed directly between components, i.e. without going through a central workflow manager. Providing such a manager is beyond the scope of DKPro Core. |
Draft |
GATE |
8.2 |
Partial |
GATE components don’t talk to each other, passing data is the job of a pipeline (which is akin to a workflow). However, if within an OpenMinTeD workflow there were a sequence of GATE components then these could be aggregated into a GATE pipeline and deployed as a single unit meaning that data would be kept within the pipeline. |
Draft |
ILSP |
1.2.1 |
Partial |
For ILSP components deployed as UIMA AS services, data to be transferred from, say, component A to component B that are remotely deployed on node X are not first passed through a message broker running on node Y |
Draft |
21. Configuration and parametrizable options of the components should be identified and documented
Concreteness: abstract
Strength: recommended
Category: WG4
Status: final
Components in systems such as Alvis offer several parameters used to configure the behavior of the component execution. In order to help users understand a specific parameter and its role, the parameters should be well identified and described. The descriptions should contain information to help users understand how and in what context a parameter is used, and also provide the data types (e.g., boolean, integer, string) accepted by a parameter.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Full |
The parameters are well-typed in Alvis, they may be simple (e.g., integer, String, boolean) or complex (e.g., list, map). A description is planned for each parameter even if some parameters currently lack descriptions. |
Draft |
ARGO |
0.5 |
Full |
As Argo components are also UIMA components containing UIMA XML descriptor files, all configuration parameters have a name, a type (String, Integer, Float, or Boolean), an optional description, can be multivalued and also made to be mandatory. Argo also allows an additional layer of information on top of the UIMA configuration parameters, such as indicating whether a string represents a file, a folder, or an enumeration. This is intended to assist in both validation of configuration values and in providing the most appropriate graphical widgets within the Argo UI. |
Draft |
DKPro Core |
1.8.0 |
Full |
Parameters are implemented in DKPro Core through the uimaFIT |
Draft |
GATE |
8.2 |
Full |
Each parameter is typed and has a name and an optional description (most have a description but it is currently optional) |
Draft |
ILSP |
1.2.1 |
Full |
Configuration parameters are declared via either UIMA XML descriptors and/or as uimaFIT |
Draft |
22. The Workflow Engine Should Permit Saving Experimental Conditions in a Workflow
Important
|
This requirement is deprecated. It should be added back as a functional requirement for the workflow execution service (WFEX). |
For reproducibility reasons extract and save/export (?) the exact experimental conditions of a workflow, such as which input files used (knowledge base/corpus), which components are used and in what order, which are the values of the parameters required per component, etc. (Requirement extracted from Interoperability Scenario WG1-4)
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Alvis allows this through its workflow expression. |
Draft |
ARGO |
0.5 |
Partial |
Argo stores the configuration parameters, order of components etc. alongside each workflow, so it can be ran again with the same values at a later date. However, this may not always allow the exact same conditions to be reproduced, due to the nature of the components in the workflow, for example, a reader component that returns documents based on a web service search may provide different documents on a subsequent run, if the index behind the web service has been updated in the meantime. Argo also lacks the ability to store sets of configuration parameter values against workflows - instead, if a user wants to change the configuration values of a workflow whilst keeping the previous set, they must first duplicate the workflow and make changes to the copy. |
Draft |
DKPro Core |
1.8.0 |
Partial |
DKPro Core readers adds a DocumentMetaData annotation to each document which contains basic information from where the document was obtained and how it can be identified (usually a file URL). Minimal information about the components/models that were used is added as annotations to the documents as well. This is mainly used as supplementary information to the tagset descriptions that DKPro Core usually adds to the document (i.e. which component/model was used to generate the annotations in a given layer and which tagset was used). A comprehensive recording of components and parameters is left to the workflow engine (i.e. UIMA) and/or to the executing environment. E.g. for building experiments, we sometimes use DKPro Lab, which also keeps track of configuration information of individual components. |
Draft |
GATE |
8.2 |
Partial |
A GATE pipeline contains all the information required to recreate it from scratch (which components, in which order, with which parameter settings). Multiple different configurations are not supported though so changing a parameter to carry out an experiment would require saving a new copy of the pipeline. |
Draft |
ILSP |
1.2.1 |
No |
Components only add annotations regarding original source (typically a text file on disk). |
Draft |
23. The Workflow Engine should permit Licence Aggregation in Workflows
Important
|
This requirement has been deprecated because it is considered to be a functional requirement of the workflow manager rather than an interoperability requirement. |
A workflow must aggregate all licences either from the components which is composed of, or from the language resources/knowledge bases that are used withing it and inform the user about them. (Requirement extracted from Interoperability scenario WG3-2)
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
No licence information handled in workflows |
Draft |
ARGO |
0.5 |
Full |
A user will be shown a full list of licenses for a workflow when attempting to run it for the first time after any component changes are made to the workflow, to which they must agree before the workflow can be executed. |
Draft |
DKPro Core |
1.8.0 |
No |
Presently, DKPro Core does not record license information in processed documents. |
Draft |
GATE |
8.2 |
No |
GATE currently doesn’t record any license information at all |
Draft |
ILSP |
1.2.1 |
No |
Components do not add license information to processed documents. |
Draft |
24. Using/treating workflows as components
Concreteness: abstract
Strength: mandatory
Category: WG4
Status: deprecated
Important
|
This requirement has been deprecated. It is considered a platform functionality requirement (WP6). |
In order to incorporate an entire workflow (not only a single component) which may potentially be using a specific workflow engine/implementation or flow configuration that cannot be imported into my workflow editor of choice, it must be possible to treat the whole workflow as a component applying encapsulation and information hiding. However, it includes the ability to configure individual components in the workflow, e.g. by exposing these parameters as workflow parameters.
Note
|
Relevant approaches are e.g. the UIMA Aggregate Analysis Engines or GATE Applications. |
Source: WG 4 Scenario 1 — Transferability of components between ecosystems
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Full |
The workflows are special components, they can be used in other workflows and Alvis offers means to handle the internal parameters. |
Draft |
ARGO |
0.5 |
No |
At present Argo doesn’t allow this. There is some support for this within the execution engine, however it hasn’t been fully tested and changes within the UI will be required. Current support within the execution engine would limit distribution of an embedded workflow to a single node, as there is no ability to distribute its components. |
Draft |
UIMA |
2.8.1 |
Full |
UIMA supports the concept of aggregate analysis engines whose delegates can be either primitive analysis engines or further aggregate engines. Aggregate engines may declare parameters whose values can be forwarded to delegates. |
Draft |
DKPro Core |
1.8.0 |
N/A |
This is out of the scope of DKPro Core. See UIMA. |
Draft |
GATE |
8.2 |
Full |
In GATE every application is also a component allowing arbitrary nesting of applications. |
Draft |
ILSP |
1.2.1 |
Partial |
See UIMA for UIMA-based ILSP components. |
Draft |
25. Incorporation of multiple resources in parallel
Concreteness: abstract
Strength: recommended
Category: WG4
Status: deprecated
Important
|
This requirement has been deprecated per discussion in WG4. If multiple resources of a type need to be added to a workflow, simply add the same component multiple times, once configured for each resource in question. Engineering components in general such that they support using multiple resources will impose an unnecessary complexity on them and was considered over-engineering. |
Whenever possible, components should allow to incorporate multiple resources in parallel, e.g. looking up a word from multiple sources or using multiple classification models in parallel. Examples:
-
Stanford CoreNLP named entity recognizer can load multiple NER models in parallel
-
a Gazetteer should be able to access not only a single knowledge resources, but be able to access e.g. multiple SparQL endpoints
Components may choose or have to disambiguate results if there are multiple hits, e.g. using a relevance or confidence score, in particular if the annotation scheme being used does not allow for multiple categories to be assigned to a single element (e.g. a Named Entity) or does not allow multiple elements to exist at the same location.
Another alternative would be to add the same component multiple times to a workflow, each time using a different resources - this would require that at least temporarily multiple categories/annotation elements would be allowed. A subsequent component could be used to select the most relevant one(s).
Caveat: Having relevance/confidence scores from different sources bears a great risk of the scores being on different scales and not comparable.
Note
|
parallel access to resources could be externalized into a resource adapter, such that the component itself would not have to deal explicitly with handling multiple adapters in parallel. |
Source: WG 2 Scenario 3 — The relation between documents and knowledge bases through keywords
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Not supported |
Draft |
ARGO |
0.5 |
No |
It is possible to model the alternative suggestion using Argo (i.e. using multiple instances of the same component but applying different resources to each one) however there is no subsequent component to determine the most relevant output. Not sure how this would be implemented at a framework level; seems very use-case specific. |
Draft |
DKPro Core |
1.8.0 |
No |
The most prevasively used mechanism to load models in DKPro Core is the ResourceObjectProvider controlled through the parameters PARAM_MODEL_LOCATION, PARAM_LANGAUGE, and PARAM_VARIANT. This is presently set up in a way that only a single value is permitted for each of these parameters. Hence, e.g. the CoreNlpNamedEntityRecognizer would have to be added multiple times to a pipeline in order to be used with multiple models. The preference resolution mechanism that is normally part of the CoreNLP NER component would have to be modelled as a separate pipeline component (which is currently not offered by DKPro Core). In a few cases, components require multiple models (e.g. OpenNlpSegmenter) in which case we currently use multiple parameters, e.g. PARAM_TOKENIZATION_MODEL_LOCATION and PARAM_SEGMENTATION_MODEL_LOCATION. A few older components not using the ResourceObjectProvider may support multiple resources to be loaded, e.g. the StopWordRemover component. |
Draft |
GATE |
8.2 |
Unknown |
you could write a GATE component in this way, but it isn’t supported directly by the framework and I know of none that do this |
Draft |
ILSP |
1.2.1 |
No |
Not supported in a generic way. |
Draft |
26. It should be possible to determine the source of an annotation/assigned category
Concreteness: abstract
Strength: recommended
Category: WG4
Status: final
It should be possible to trace back which component produced an annotation element or assigned a category and which knowledge resources may have been involved in the process. This does not entail that every annotation element must have explicit provenance metadata. The requirement could also be solved by assuming that all annotations of a certain type or within a certain view were created by one component and providing just provenance at the level of the type or view.
The ability to trace back could e.g. be relevant in order to perform a post-hoc linking/disambiguation of categories obtained from different knowledge resources or to adjust/harmonize relevance/confidence scores based on different scales.
Note
|
DKPro Core includes tagset information into the annotated document (in most cases even for tags that were not seen, but present in the model). U-Compare had strong support for provenance. Some annotation schemes have support for relevance/confidence scores, but others do not. |
Source: WG 2 Scenario 3 — The relation between documents and knowledge bases through keywords
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
The creator of an annotation is identified but no trace about the involved resources or annotation categories. |
Draft |
ARGO |
0.5 |
No |
There is no information output from an execution that identifies which components produced individual annotations. |
Draft |
DKPro Core |
1.8.0 |
Partial |
Most DKPro Core components add tagset information. This mechanism currently assumes that only a single component in a workflow creates annotations of a given layer (i.e., the information is maintained at the level of layers, not at the level of individual annotations). Further, the mechanism requires that the tagset information can actually be extracted from the model being used. |
Draft |
GATE |
8.2 |
Partial |
Some existing components (specifically the JAPE transducer) add information to annotations about what created them, but currently this isn’t standard across all components, although I think I can see a way of doing so. |
Draft |
ILSP |
1.2.1 |
No |
No relevant annotation added. |
Draft |
27. Components should handle failures gracefully
Concreteness: abstract
Strength: recommended
Category: WG4
Status: deprecated
Important
|
This requirement has been deprecated. We don’t feel this requirement is relevant to the interoperability of components. These isssues will be raised in WP6. |
In the event that a component fails (e.g. due to a loss in network connectivity when writing to a database), then the component should ensure that there are no side-effects (e.g. database in an inconsistent state) which prevent the component from being re-ran at a later date, from the same point as which it failed. This would be required if the execution platform were to offer checkpointing.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Alvis however can re-run a component from a data dump. |
Draft |
ARGO |
0.5 |
Partial |
There is no framework support for this (e.g. when an exception occurs in a component, a rollback/onError callback could be made on the component to handle any issues), however when developing components for Argo we always try to ensure that this requirement is implemented. There aren’t many components in Argo, if any at all, that will produce unwanted side-effects on the occurrence of an exception. |
Draft |
DKPro Core |
1.8.0 |
N/A |
DKPro Core does not provide any workflow functionality. It relies on such functionality to be provided by a workflow engine. Components are usually implemented fail-fast: as soon as a problem occurs, an exception is thrown. How the workflow engine (or application embedding DKPro Core components) handles these is up to them. |
Draft |
GATE |
8.2 |
No |
currently if a component fails (in a way not expected by the developer) than the exception will propagate back up to the workflow and halt execution. |
Draft |
ILSP |
1.2.1 |
No |
Not supported. |
Draft |
28. Processing components should be downloadable
Concreteness: abstract
Strength: recommended
Category: WG4
Status: final
Processing components should be downloadable. The user should be able to choose on which infrastructure to run them, e.g. to ensure that the processed data does not leave a particular institution. Components that are not downloadable should clearly indicate that (e.g. using a "I am a service" flag).
Source: WG1 Scenario 2 — SME running research analytics for funders within the European Research Area
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Full |
AlvisEngine and its components can be downloaded. Docker images for Alvis in progress. |
Draft |
ARGO |
0.5 |
No |
Currently workflows and components only exist within the Argo platform. |
Draft |
DKPro Core |
1.8.0 |
Full |
All released DKPro Core components come as downloadable Maven artifacts and are distributed via Maven Central. |
Draft |
GATE |
8.2 |
Full |
components are created by loading a plugin from a given URL. This results in the jar files being downloaded to the local machine in some way (dependent on a number of factors) and then added to the JVM. This means all processing occurs where the framework is being run and data does not have to leave the local machine. |
Draft |
ILSP |
1.2.1 |
No |
Processing components are available only as services. |
Draft |
29. The actual content of all content resources must be discoverable
Caution
|
This requirement has been deprecated in favor of 4. URL to actual content must be discoverable. |
The URL of the actual content of a resource, e.g. a text file, PDF file, or other common data format, must be derivable from readily discoverable data. It must either be included in the metadata or it must be reliably constructable from the metadata URL, e.g. by appending or changing a suffix; derivation rules must be clearly documented.
Note
|
This requirement refers only to content resources that can be used as input for a TDM process (i.e. corpora of publications etc.), or to knowledge resources (e.g. annotation schemas, typesystems, grammars etc.) that are used as ancillary resources for the operation of TDM components. The requirement is set for the proper operation of the TDM components in so far as they need to easily access the resources they operate on or with. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Actual content (either pdf, text, or other) is available via constructing URL based on identifiers (internal CORE id, oai identifier, doi ) provided in the metadata record |
Final |
OpenAIRE |
Jun-16 |
Partial |
There’s a link to the doi but not always; and usually the doi sends to a landing page, but there are some rules that can be used for "guessing" the download link |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Full |
From XMLS, the URL can be reliably constructed using the DOI. |
Final |
TheSoz |
Jun-16 |
Full |
Url to the resource and the SPARQL endpoint is provided in the void.ttl file (however file is not at the moment available online due to technical issues). |
Final |
Agrovoc |
21/01/2016 |
Full |
Final |
|
JATS |
1.1 |
Partial |
although the URL to the JATS schemas (as DTDS, Relax NG & XSD) and elements is available at stable URIs, each at a separate folder (info is provided at https://jats.nlm.nih.gov/files.html), there is no formal metadata to indicate this |
Final |
OLiA |
Jun-16 |
Full |
Final |
|
LAPPS |
Jun-16 |
Full |
Final |
30. Metrics for the confidence level of the TDM operation should be included in the metadata
Important
|
This requirement has been deprecated, as it is premature and requires defining evaluation framework first. |
It is important to include metadata stating the confidence level of the TDM operation in order to help users select the appropriate components.
Note
|
This requirement is considered premature; although it is important, we need first to define the appropriate metrics, methods and tools and perform evaluation on components in an objective way and then add it to their metadata. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Alvis does not provide this aspect. Alvis does not yet deal with metadata. |
Final |
ARGO |
0.5 |
No |
Argo components don’t have any specific metadata relating to confidence levels that will help users selecting components when developing a workflow. We would expect this information to reside within the main textual description of a component, however none of the existing Argo components provide this. There are a number of Argo components that produce confidence level values against annotations they produce during a workflow execution, however this wouldn’t appear to directly relate to this requirement. |
Final |
DKPro Core |
1.8.0 |
No |
The confidence/quality of models (NOT necessarily tools) is typically very much depending on the processed data. It would be a lot of effort to provide such information. The main problem, however, is that there is not a wide range of free suitable gold standard corpora that could be used to generate such information. |
Final |
GATE |
8.2 |
Partial |
Confidence level is available for tools where this is appropriate. In some cases, e.g. a tool for converting data from one format to another, it is not appropriate |
Final |
ILSP |
1.2.1 |
Partial |
Some ILSP tools provide confidence level attributes for annotations, e.g. named entities. |
Final |
31. Metrics for the performance of the TDM operation should be included in the metadata
Important
|
This requirement has been deprecated, as it is premature and requires defining evaluation framework first. |
It is important to include metadata related to the performance of TDM operation the appropriate components.
Note
|
This requirement is considered premature; although it is important, we need first to define the appropriate metrics, methods and tools and perform evaluation on components in an objective way and then add it to their metadata. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Alvis does not provide this aspect. |
Final |
ARGO |
0.5 |
No |
No performance information given by existing components |
Final |
DKPro Core |
1.8.0 |
No |
See above |
Final |
GATE |
8.2 |
Partial |
Performance is sometimes available for specific tools over specific collections, but not generally. Quoting performance would also need us to quote the conditions under which it was obtained |
Final |
ILSP |
1.2.1 |
No |
Not supported |
Final |
32. Version must be included in the metadata description for all resources
Important
|
This requirement has been deprecated. It is covered by 6. Components should have a unique identifier and a version number and 102. Adding version information in the metadata descriptions of all resources. |
Version information must be included in the metadata of all resources.
Note
|
At this point, we only wish to check whether versioning is represented in the metadata, and the recommended action can be to at least add the last update date; however, to ensure interoperability, we need to create guidelines for defining the notion of "version" for the same resource type and promote standardisation of its representation, at least per resource type; we also need to distinguish between versions of the metadata records and versions of the resources themselves |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Metadata format harvested at the moment (oai_dc) does not contain such information. However future integrations of other md formats (such RIOXX / OpenAIRE) contain this information and if provided (from the source repository) then it shall be captured by CORE (and therefore propagated from there to OMTD) |
Final |
OpenAIRE |
Jun-16 |
No |
not in the metadata of the original data provider |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Partial |
Licensing and version is included in the body of the article. For supplementary materials, most recent version is used. Frontiers has not published an article using tools. |
Final |
TheSoz |
Jun-16 |
Full |
owl:versionInfo in void.ttl file |
Final |
Agrovoc |
21/01/2016 |
Partial |
modification date (dcterms:modified) is available |
Final |
JATS |
1.1 |
Full |
Provided in the schema |
Final |
OLiA |
Jun-16 |
Partial |
no version id, but a textual description of the resource’s evolution. |
Final |
LAPPS |
Jun-16 |
No |
Final |
|
Licences |
Jun-16 |
Full |
Reasons of (legal) clarity and certainty require the most clear and reliable information to be attached to the resource/s |
Final |
Alvis |
0.5rc |
No |
Alvis does not provide this aspect. |
Final |
ARGO |
0.5 |
No |
Argo components have versions specified in both their POM and UIMA XML files, however these are usually not updated by developers, are not guaranteed to be synchronised and are not exploited by the Argo system. For example, an Argo instance will only ever use the latest version of a component when executing a workflow. |
Final |
DKPro Core |
1.8.0 |
Full |
Version is included in the POM file and in the UIMA XML descriptors for components and in the POM file and properties file for each model. |
Final |
GATE |
8.2 |
No |
Planned |
Final |
ILSP |
1.2.1 |
Partial |
Only for components integrated in infrastructures like CLARIN and METASHARE. For example https://goo.gl/yDynbu |
Final |
33. Licensing information must be included in the metadata
All resources must be accompanied with a clear indication of their licensing terms; this can be indicated by the licence name, reference to a url or document that is accessible to the user and, if possible, a user-friendly summary.
Note
|
For this version, we will allow also "open access" or "other" etc. as statements in the metadata; in the next versions, this needs to be standardised, promoting licence values. It would also be preferable to attach the full document of the licence for (legal) certainty reasons |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Corpus coming from Open Access repositories, explicit licensing information included where available |
Final |
OpenAIRE |
Jun-16 |
Partial |
All publications are harvested from open access sources, but licence values are not always present in the metadata of the original data provider |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Full |
CCBy Licensing and version is included in the body of the article. |
Final |
TheSoz |
Jun-16 |
Full |
|
Final |
Agrovoc |
21/01/2016 |
Full |
Final |
|
JATS |
1.1 |
Partial |
although not in the metadata, the documentation includes a faq stating "The Standard (both PDF and HTML versions) is copyrighted by NISO, but all of the non-normative information found on this site is in the public domain. That includes all of the schemas and the Tag Libraries. The Tag Sets may be used freely and without permission from either the NLM or NISO." |
Final |
OLiA |
Jun-16 |
No |
Not yet; plan to be released under a Creative Commons Attribution Sharealike licence as soon as a reference publication has appeared |
Final |
LAPPS |
Jun-16 |
No |
The LAPPS Web Service Exchange Vocabulary does not include any types for license information. |
Final |
Licences |
Jun-16 |
Full |
Only for standard public licences |
Final |
Alvis |
0.5rc |
No |
Alvis does not provide this aspect but some licence information are provided as free texts in the tool descriptions or in README of some tools. |
Final |
ARGO |
0.5 |
Partial |
Licensing information regarding components can be included within the Argo Descriptor File, however the majority of existing components within Argo do not have this information. When available, it’s in free text. |
Final |
DKPro Core |
1.8.0 |
Partial |
Licensing information for our code is included in the POM file. Licenses for transitive dependencies can be reached (where available) by traversing the dependency hierarchy and inspecting the POM files of the dependencies. For models, we have no license information most of the time. |
Final |
GATE |
8.2 |
Partial |
The tool metadata does not provide licensing information, but license information is distributed with the tool |
Final |
ILSP |
1.2.1 |
Partial |
Only for components integrated in infrastructures like CLARIN and METASHARE. For example https://goo.gl/yDynbu |
Final |
34. Licensing information should be expressed in a machine-readable form
Licensing terms should be expressed in a machine-readable form so that the s/w can automatically compute permitted uses (cf. REL & CC-like licensing elements).
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Licences |
Jun-16 |
Partial |
A few standard licences are available in machine-readable form, in RDF (CC-REL/ODRL) but they are not all officially recognised; additionally, for licences of DK-PRO Core, there is usually an URL reference to the license text in the POM, which is used by the Maven SPDX plugin to generate an RDF description of the licenses. |
Final |
DKPro Core |
1.8.0 |
Partial |
Licensing information for our code is included in the POM file. Licenses for transitive dependencies can be reached (where available) by traversing the dependency hierarchy and inspecting the POM files of the dependencies. For models, we have no license information most of the time. |
Final |
35. All resources must include a unique persistent identifier
Important
|
This requirement has been deprecated. It is covered by REQ-6 and WG1-8. |
All resources must include an identifier (e.g. PID, DOI etc.) that allows them to be uniquely, non-ambiguously and persistently identified.
Note
|
for this version, we only need a unique identifier, of whichever scheme the provider already uses; for next versions, we should recommend a scheme |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Full |
Offer a list of identifiers (internal CORE id, source’s OAI identifier and where available or resolved via external service a DOI) |
Final |
OpenAIRE |
Jun-16 |
Partial |
We keep the original OAI identifier and all identifiers harvested from the metadata of the original data provider; but the identifiers may not always be obligatory |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Full |
DOI has an identifier. All resources follow PMC naming convention. |
Final |
TheSoz |
Jun-16 |
No |
Final |
|
Agrovoc |
21/01/2016 |
Full |
Final |
|
JATS |
1.1 |
Full |
see point 1 |
Final |
OLiA |
Jun-16 |
Full |
Final |
|
LAPPS |
Jun-16 |
Full |
Final |
|
Licences |
Jun-16 |
Full |
For standard licences, the url can also be used; This would meet the demand for clear and stable locus of the resource/s |
Final |
Alvis |
0.5rc |
Partial |
A local identification of the resources is done. |
Final |
ARGO |
0.5 |
Full |
All components and type systems have unique identifiers. |
Final |
DKPro Core |
1.8.0 |
Full |
Components can be uniquely identified using their corresponding (<groupId>, <artifactId>, <versionId>) tuple |
Final |
GATE |
8.2 |
Full |
Fully qualified class name of processing resources is unique. |
Final |
ILSP |
1.2.1 |
Partial |
Only for components integrated in infrastructures like CLARIN and METASHARE. For example https://goo.gl/yDynbu |
Final |
36. Classification metadata should be included, where applicable, in the metadata record of the resource
It is highly recommended to include classification, as applicable, to resources (e.g. domain, text type, genre etc.); the preferred form should be in accordance to one of the recommended controlled vocabularies/authority lists; if not, a link to a URL that contains the list of values used for the specific classification should be included; in all cases, the source of value for the classification must be properly indicated (through the name or link to the vocabulary) and available to the user; mappings between the controlled vocabularies are recommended.
Note
|
For this version, we include in the metadata what exists in the resource metadata; for next versions, we promote standardisation through at least the marking of the classification scheme. It is also important to distinguish between the various classification information, e.g. subtypes of documents or tools vs. domain of documents and domain covered by a tool. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Very limited classification metadata: only what the original metadata record offers (keywords, topics, or others that are usually not part of a defined vocabulary) |
Final |
OpenAIRE |
Jun-16 |
Partial |
Mainly keywords from the original metadata record; but this is not always present; in some subsets we have also used topic classification algorithms |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Partial |
Frontiers is in the middle of deploying an XML based on JATS. i.e. Article types are mapped to JATS. Domain, field, specialty, and taxonomy used is Frontiers defined. |
Final |
TheSoz |
Jun-16 |
Full |
via DbPedia URIs. The following values are present in the current |
Final |
Agrovoc |
21/01/2016 |
Full |
DbPedia URIs |
Final |
Alvis |
0.5rc |
No |
Not yet |
Final |
ARGO |
0.5 |
No |
No metadata regarding component classification, however the name of a component is a generally a good indicator |
Final |
DKPro Core |
1.8.0 |
Partial |
Tool classification is recorded in the documentation (e.g. at https://dkpro.github.io/dkpro-core/releases/1.7.0/components/ ) but not in the metadata. However, the documentation is automatically created (since 1.8.0) and a classification can be automatically derived from the component names. |
Final |
GATE |
8.2 |
Partial |
Tool classification is partially described by naming conventions |
Final |
ILSP |
1.2.1 |
No |
Not supported |
Final |
37. Information on the structural annotation (layout) of resources should be included in the metadata of the resource
Concreteness: abstract
Strength: recommended
Category: WG1
Status: final
Resources, such as publications & text files from corpora, that are annotated at the structural level (i.e. are tagged as to elements of their internal structure) should include in their metadata a reference to the annotation scheme used (e.g. TEI, JATS etc.), through a name or link to a url
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Structure of resources not included as part of metadata record but described separately |
Final |
OpenAIRE |
Jun-16 |
Partial |
If the resource is included, it may be PDF, HTML or Word, but without any structural encoding |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Full |
Article metadata include references to the annotations use. Frontiers uses NLM DTD. |
Final |
38. Access mode of resources must be included in the metadata
Access mode of all resources (e.g. for s/w, whether they are downloadable or accessible as web services/workflows & for L/KRs through i/f, SPARQL endpoints etc.), must be included in the metadata
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Full |
Resources are available, accessible and downloadable by default. If resources are not available is explicitly stated in a separate field (marked as deleted/disabled) |
Final |
OpenAIRE |
Jun-16 |
Partial |
If the resource is available, it can only be downloadable |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Full |
For researchers, the page includes access to the resources (PDF, ePUB, XML, readcube). For crawlers, the FTP site to access resources can be provided upon request. |
Final |
TheSoz |
Jun-16 |
Full |
SPARQL endpoint and URI lookup endpoint in |
Final |
Agrovoc |
21/01/2016 |
Full |
void:sparqlEndpoint, void:dataDump |
Final |
JATS |
1.1 |
No |
No formal metadata description for JATS |
Final |
OLiA |
Jun-16 |
Full |
Donloadable URLs are given in rdf:about and xml:base attributes |
Final |
LAPPS |
Jun-16 |
Full |
The LAPPS Web Service Exchange Vocabulary defines the type |
Draft |
Alvis |
0.5rc |
No |
No explicit and formal specification of access mode |
Final |
ARGO |
0.5 |
Partial |
Components can be declared as native to Argo (allowing them to be executed within the Argo execution framework) or as web services, however there are also existing native components that simply wrap web services, so these types are not always mutually exclusive. |
Final |
DKPro Core |
1.8.0 |
Full |
All of the components are portable software packages available via Maven artifact repository |
Final |
GATE |
8.2 |
Full |
All components are downloadable packages |
Final |
ILSP |
1.2.1 |
Full |
Only for components integrated in infrastructures like CLARIN and METASHARE. For example https://goo.gl/yDynbu |
Final |
39. Content resources must include metadata on their format (e.g. XML, DOCX etc.)
Concreteness: abstract
Strength: mandatory
Category: WG1
Status: final
Content resources must include metadata on their format (e.g. XML, DOCX etc.), preferrably in accordance to the IANA mimetype; where applicable, further information on the specific type of the format (e.g. XML compatible with a specific schema XSD) should be included; when the resource is the output of an OMTD operation, the metadata should automatically receive the appropriate format values
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Resources contain their format type in their response Content-type header. Not included in the metadata record |
Final |
OpenAIRE |
Jun-16 |
No |
Various formats (doc, pdf, html etc.) but not specified in the metadata |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Full |
Available in XML in NLM 3.0, available as JATS. Also available in ePUB |
Final |
40. Component metadata must include standardised categories/tags that make them easy to discover
Important
|
This requirement has been deprecated. It is covered by REQ-8. |
Component (web services & workflows but also of tools) classification & description must include metadata that describe the task they perform (e.g. annotation, named entity recognition), preferably in the form of a standard(ised) vocabulary
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
No metadata |
Final |
ARGO |
0.5 |
Partial |
Components can be grouped by type (readers,, writers, analytics, web services) and by the type systems they use. |
Final |
DKPro Core |
1.8.0 |
Partial |
Tool classification is recorded in the documentation (e.g. at https://dkpro.github.io/dkpro-core/releases/1.7.0/components/ ) but not in the metadata. However, the documentation is automatically created (since 1.8.0) and a classification can be automatically derived from the component names. |
Final |
GATE |
8.2 |
No |
Final |
|
ILSP |
1.2.1 |
Partial |
Only for components integrated in infrastructures like CLARIN and METASHARE. For example https://goo.gl/yDynbu |
Final |
41. Content resources must include metadata on their language(s)
Content resources including language, knowledge and text files must include metadata on their language(s)
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Described only the metadata record, not for each content resource separately |
Final |
OpenAIRE |
Jun-16 |
Partial |
If in the metadata of the original data provider, the document language is specified |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Full |
Meta-text available for language. Other fields defined. |
Final |
TheSoz |
Jun-16 |
Full |
via language attribute for each entity in the thesaurus |
Final |
Agrovoc |
21/01/2016 |
Full |
Concept labels include language attribiutes, see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
JATS |
1.1 |
No |
Only in English as far as I have checked |
Final |
OLiA |
Jun-16 |
No |
Language only indirectly deducible from the description of the resource in the data type property 'comment' value. |
Final |
LAPPS |
Jun-16 |
No |
No |
Final |
42. The metadata can include the information on which projects/workflows involve the resource
Important
|
This requirement is deprecated. It should be added back as a functional requirement for the workflow execution service (WFEX). The functional requirement should state that the registry should be able to collect and display all resources (e.g. workflows, components, whatever) that use a given resources (e.g. a component, model, KR, etc.). This information must be dynamically collected and displayed in the registry through a query. It cannot be part of static metadata. |
The metadata can include the information on which projects/workflows involve the resource, thus allowing to state resource "popularity", increase reliability, bring TDM outputs to a better interoperability.
Note
|
For use inside OMTD, this can be moved to the platform functions; for use out of OMTD, this can be left to the providers/users, so we can at best recommend it. This information can be used for evaluation purposes. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Only if source’s includes this information - core covers RIOXX’s format (which offers this information - OpenAIRE equivalent for UK) that only a small subset of UK repositories is compliant to (and UK repos is only a small subset of the global coverage so essentially a very minor portion of the whole corpus) |
Final |
OpenAIRE |
Jun-16 |
Partial |
Only as regards funding |
Final |
TheSoz |
Jun-16 |
No |
Unknown |
Final |
Agrovoc |
21/01/2016 |
No |
Unknown |
Final |
JATS |
1.1 |
No |
Unknown |
Final |
OLiA |
Jun-16 |
No |
Unknown |
Final |
LAPPS |
Jun-16 |
No |
Unknown |
Final |
Licences |
Jun-16 |
Unknown |
Unknown |
Final |
Alvis |
0.5rc |
No |
No metadata |
Final |
ARGO |
0.5 |
No |
This would be fairly simple to implement on a per-instance basis of Argo; all of the data required to conform to this requirement is held within an SQL database used by Argo. |
Final |
DKPro Core |
1.8.0 |
No |
It would be unfeasible to maintain a list of which project/workflows make use of our components. We do not know who is using the software and for which purposes. |
Final |
GATE |
8.2 |
No |
Unknown |
Final |
ILSP |
1.2.1 |
No |
Unknown |
Final |
43. S/W (tools, web services, workflows) must indicate whether they are language-independent or the language(s) of the resources they take as input and output
S/W (tools, web services, workflows) must indicate whether they are language-independent or the language(s) of the resources they take as input and output
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Not supported |
Final |
ARGO |
0.5 |
No |
UIMA descriptors allow the definition of languages supported by a component, however Argo doesn’t make use of this information at present and currently no Argo components declare this metadata. |
Final |
DKPro Core |
1.8.0 |
Partial |
Most DKPro Core components take a model as a parameter. The model then usually determines the language. Also for components that are locked to a specific language, this information is currently not included in the component metadata. |
Final |
GATE |
8.2 |
No |
Not currently provided |
Final |
ILSP |
1.2.1 |
Partial |
Only for components integrated in infrastructures like CLARIN and METASHARE. For example https://goo.gl/yDynbu |
Final |
44. Statistical metadata that allow monitoring of resource versions may accompany resources
Resources must include statistical data (e.g. size of resource, total number of concepts, etc.) which allow monitoring of changes across versions; this will help measuring performance of components with different versions of the same resource.
Note
|
for version 1 this is set to optional, but for version 2 we should increase the strength for specific metadata elements |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
Jun-16 |
Partial |
number of triples contained inside the resource and number of triples aligned with other resources (e.g. dbpedia) |
Final |
Agrovoc |
21/01/2016 |
Full |
number of concepts, labels |
Final |
OLiA |
Jun-16 |
No |
The OLiA website does not specify any statistics, or metadata elements covering statistics. http://nachhalt.sfb632.uni-potsdam.de/owl/ |
Final |
LAPPS |
Jun-16 |
No |
The LAPPS Web Service Exchange Vocabulary does not specify and metadata elements covering statistics. |
Final |
45. S/W (tools, web services, workflows) must indicate format of their output
s/w must indicate the specific formats of their output (e.g. statistical data, annotated corpora etc.)
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Information is limited |
Final |
ARGO |
0.5 |
Partial |
Components can optionally declare what types they will output. |
Final |
DKPro Core |
1.8.0 |
Partial |
via @TypeCapability tags and also sometimes declared as javadoc. Mind that some components cannot clearly state the supported inputs/outputs until they actually process a document and have loaded the necessary models/rules. It may also be the case that despite declaring a certain input, the component may work without it or that despite declaring a certain output, the component may not actually produce it (e.g. because the user disabled the creation of the output or because a document simply contains no instance of the specific type) |
Final |
GATE |
8.2 |
No |
GATE components do not specify any information regarding their input and output, mostly as there is no defined type system in GATE. Essentially we use the document as a blackboard so the input and output is the document with extra information added to it in the form of annotations and/or features, but each component is free to use that blackboard as they wish. |
Final |
ILSP |
1.2.1 |
Partial |
Only for components integrated in infrastructures like CLARIN and METASHARE. For example https://goo.gl/yDynbu |
Final |
46. Output resources of web services/workflows must be accompanied by provenance metadata
Important
|
This requirement is deprecated. It should be added back as a functional requirement for the workflow execution service (WFEX). |
Output resources of web services/workflows must automatically be accompanied by appropriate metadata that can be derived from the usage of the service/workflow (e.g. format, language etc.) & indicating the processing thereof (provenance).
Note
|
The mechanism of adding this information is a function of the OMTD platform, but the actual values of the elements that need to be enriched or added are dependent on the s/w, so this information must be present in the s/w metadata that describes the output |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Alvis can provide some information, for example data formats and logs |
Final |
ARGO |
0.5 |
No |
Not supported |
Final |
DKPro Core |
1.8.0 |
Partial |
Some information is recoded in the TagSetDescription, TagDescription and DocumentMetaData. |
Final |
GATE |
8.2 |
No |
Not supported |
Final |
ILSP |
1.2.1 |
Partial |
Only for components integrated in infrastructures like CLARIN and METASHARE. For example https://goo.gl/yDynbu |
Final |
47. Information on funding of resources may be included in the metadata
Information on funding sources (e.g. projects, funders data etc.) may be included in the metadata of the resources
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
same as point 17 |
Final |
OpenAIRE |
Jun-16 |
Full |
If the original provider includes it, we harvest it |
Final No funding information |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Partial |
New articles have the source of funding, if authors declare (i.e. connection to fundref) |
Final |
TheSoz |
Jun-16 |
No |
No funding information included in metadata |
Final |
Agrovoc |
21/01/2016 |
No |
only its creator (FAO) |
Final |
JATS |
1.1 |
No |
No formal metadata exist for JATS |
Final |
OLiA |
Jun-16 |
No |
Some ontologies reference the project in which the resource was created. |
Final |
LAPPS |
Jun-16 |
No |
The LAPPS Web Service Exchange Vocabulary does not specify and metadata elements related to funding information. |
Final |
Alvis |
0.5rc |
No |
No funding information included in metadata |
Final |
ARGO |
0.5 |
No |
Not implemented as we have not foreseen a need for this. We would expect this type of information to exist in the textual description of a component. No plans to implement. |
Final |
DKPro Core |
1.8.0 |
No |
DKPro Core is a community project. |
Final |
GATE |
8.2 |
No |
We do support a help URL though where funding information may be listed, but nothing explicit. |
Final |
ILSP |
1.2.1 |
No |
Not a standard practice |
Final |
48. All resource metadata records must include a reference to the metadata schema used for their description
Note
|
Deprecated because this is a requirement on each metadata schema rather than the actual resources |
All resource metadata records must include a reference to the metadata schema used for their description.
Note
|
for provenance information, we need to know the metadata schema from which we are extracting/converting information. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Jun-16 |
Partial |
Schema not described formally in .xsd format but as part of API documentation |
Final |
OpenAIRE |
Jun-16 |
Full |
OpenAIRE schema |
Final |
Frontiers |
NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 |
Partial |
XML provides the reference to the metadata schema. In a multiple metadata scenario, Frontiers provides the DOI of the related articles. |
Final |
TheSoz |
Jun-16 |
Full |
defined in the header |
Final |
Agrovoc |
21/01/2016 |
Full |
see rdf prefixes |
Final |
JATS |
1.1 |
No |
no metadata |
Final |
OLiA |
Jun-16 |
Full |
List of namespaces |
Final |
LAPPS |
Jun-16 |
No |
Final |
|
Alvis |
0.5rc |
No |
No metadata support |
Final |
ARGO |
0.5 |
Partial |
Argo components are defined using UIMA XML descriptors, which do have a schema reference, and with an optional Argo XML descriptor, which don’t have a schema reference. |
Final |
DKPro Core |
1.8.0 |
Partial |
UIMA XML descriptors and POMs contain a schema reference. Ad-hoc DKPro Core-specific metadata is currently not governed by a formal schema. |
Final |
GATE |
8.2 |
No |
Final |
|
ILSP |
1.2.1 |
No |
Not supported |
Final |
49. Metadata of tools should contain information about the models available for them
Important
|
This requirement is deprecated. Neither the model creators nor the component creators may know of all models/components they are compatible with. We discussed to keep component metadata, the model metadata, and the linking between components and models separately. A suggestion is to rephrase that requirement as "resource providers (e.g. components) should provide a list of other resources (e.g. models) they are compatible with". The concrete format of such a compatibility list likely needs to be defined. Additionally, the requirement should be turned into a functional requirement stating that the registry should be able to collect and display all resources (e.g. workflows, components, whatever) that are compatible with a given resources (e.g. a component, model, KR, etc.). This information must be dynamically collected and displayed in the registry through a query. It cannot be part of static metadata. This is different from REQ-42. REQ-42 is about actually depending/making use of other resources. REQ-49 is about being compatible (i.e. potentially making use of) other resources. |
Information about the models available for the tools (e.g. location, language, domain, etc) must be included in the metadata (c.f. https://dkpro.github.io/dkpro-core/releases/1.7.0/models/ as an example).
Note
|
This is important for the operation of the components; so, we need to standardize naming/reference to the models and add it to the metadata. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
No metadata support |
Final |
ARGO |
0.5 |
No |
Not supported and not planned. Would require a mechanism adding to Argo which allows users to register models against components. |
Final |
DKPro Core |
1.8.0 |
Full |
The POM of a tool contains references to the known available models. This information is can be used to automatically derive which models are available for which tools. We do this as part of our automatically generated documentation. |
Final |
GATE |
8.2 |
No |
Final |
|
ILSP |
1.2.1 |
No |
Not supported |
Final |
50. Documentation references should be versioned
When there is a link to documentation, that documentation should be versioned and should be about the version of a component/resource which carries the reference metadatum. The link should not simply point at the latest version of the documentation which may or may not be relevant for the given component/resource version in question.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
Jun-16 |
No |
Documentation is not versioned (http://lod.gesis.org/thesoz/de/help.html) |
Final |
Agrovoc |
21/01/2016 |
No |
Documentation (e.g. FAQ and other documents) listed on http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus are not versioned |
Final |
JATS |
1.1 |
No |
Documentation is not versioned |
Final |
OLiA |
Jun-16 |
No |
See http://nachhalt.sfb632.uni-potsdam.de/owl/ - where documentation is provided, it is not versioned |
Final |
LAPPS |
Jun-16 |
No |
The LAPPS Web Service Exchange Vocabulary is not versioned itself and thus there is also no versioned documentation. |
Final |
Licences |
Jun-16 |
Partial |
Identifying the version of the associated documentation is optimal, but it still could be linked either to "any later version" (as long as it is specified that the version number in question applies to it and to any late version) |
Final |
Alvis |
0.5rc |
No |
No version |
Final |
ARGO |
0.5 |
No |
There is currently no mechanism in place for linking changes in a version of a component with any links to its documentation. |
Final |
DKPro Core |
1.8.0 |
Full |
Documentation is versioned in parallel with the source code |
Final |
GATE |
8.2 |
No |
There is currently no mechanism in place for linking changes in a version of a component with any links to its documentation. |
Final |
ILSP |
1.2.1 |
No |
Not supported |
Final |
51. License should be attached
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
The licence (full legal document) must be attached to the resource, or be publicly available (e.g. URI to permanent web page).
The licence should be attached to the resource, since there is no guarantee that the permanent URI will not change in the future. Public projects such as CC, GNU, Apache, provide with reasonable certainty that the licence will remain publicly available in a permanent location. This has developed a standard, but for smaller projects and/or individual providers may not be a sufficient guarantee.
52. License information must be in metadata
Important
|
This requirement has been deprecated as a duplicate of 33. Licensing information must be included in the metadata. |
All resources must have a licence clearly indicated in the metadata description of the resource.
53. Licensor must be entitled to grant license
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
The licence must be applied by the right holder or an authorised party
54. Licensees should remain with a copy of the license
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
Researchers, data providers and repositories should retain a copy of the text of the licence.
55. Standard licenses should be used
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
Licences should be expressed in standard form (Public Licences such as CCPL, GNU GPL, etc). Specifically devised licences in fact create additional transactive costs in terms of compatibility.
56. License should be machine readable
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
The licence should be expressed not only in the full standard legal document, but also in a machine readable form, preferably using one of the standard Rights Expression Languages (RELs) and relevant metadata schemas and vocabularies.".
57. License should be understandable by non-lawyers
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
The licence should also be expressed in a statement that makes it easy for non lawyers understand the main rights and obligations (similar to CC “commons deed”).
58. TDM must be explicitly allowed
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
To be TDMable a licence must specifically allow Text and Data Mining for (commercial; non commercial; research only; any;) purposes.
Alternatively, [not sure how to express this condition] to be TDMable a licence must specifically identify the rights included in its scope and the uses permitted.
59. Right for (temporary) reproduction must be granted
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
In order to allow TDM a licence must permit the reproduction, at least temporary, of the content protected by (copyright, SGDR, related rights).
60. Boundary for derivative work must be clearly defined
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
The licence must clarify what activities can be done with the results of TDM which can qualify as a derivative work of the original protected work.
61. No restrictions on TDM results which are not derived works
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
The licence should not unduly restrict what can be done with results that are not derivative works of the original protected works.
62. World-wide and irrevocable license grant
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
To be TDMable a licence should not include restriction as to the geographical or temporal dimension of its validity.
The licence must be irrevocable in order to grant the maximum level of reusability.
63. License must qualify for Open Access rights
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
Open Access compliant licences must include in their scope all rights covered by copyright and related rights, such as reproduction, redistribution, communication to the public, making available to the public, public performance, translation, adaptation, etc.
64. License must qualify for Open Access uses
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
Open Access compliant licences must cover all subject matter related to copyright and related rights such as: performers’ performances, broadcasts of broadcasting organisations, sound recordings (Rome), first fixation of a film (EU), SGDR (EU), and employ an open ended wording to include also additional (usually national or regional) related rights (e.g.: non original photographs, scientific editions of public domain works, typographical arrangements, etc).
65. License must qualify for Open Access must not restrict use in any way
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
Open Access compliant licences must must permit the exercise of the aforementioned rights by anyone for any purpose, without limitations regarding temporal or geographical dimensions, nor should the purpose of the use be limited to non commercial, research only, academic uses, and similar unduly restrictive wording.
66. License must qualify for Open Access may include attribution requirements
Concreteness: abstract
Strength: recommended
Category: WG3
Status: draft
Open Access compliant licences must can only features limitations connected with the requirement to acknowledge authorship (paternity) and to guarantee the integrity of the work.
67. Knowledge Resource Element Id
Concreteness: abstract
Strength: recommended
Category: WG2
Status: final
For the purposes of resource access and resource schema harmonization, schema elements in Knowledge Resources should be discoverable, identifiable and dereferencable/resolvable through a URI. This is in line with the Linked Data paradigm. If deemed necessary, multiple ids are allowed for the same knowledge resource element if these are linked by means of an equivalence relation in the registry.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
June 2016 |
Full |
Schema elements are from SKOS vocabulary and an extension of it defined by http://lod.gesis.org/thesoz/ext/thesoz_ext.rdf |
Final |
Agrovoc |
21/01/2016 |
Full |
URI concept identifiers, see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
JATS |
1.1 |
Partial |
JATS is available in the form of a suite of DTD’s. All elements and attributes are documented with their own html page but do not have a URI assigned to them. |
Final |
OLiA |
June 2016 |
Full |
OLiA is available as OWL models, with URI concept identifiers |
Final |
LAPPS |
June 2016 |
Full |
The LAPPS Web Service Exchange Vocabulary is specified in RDF and the concept identifiers are URLs that point to their respective documentation. |
Final |
Ontolex |
June 2016 |
Full |
Documentation is versioned in parallel with the source code |
Final |
CLARIN CCR |
June 2016 |
Full |
All concepts have their own id, a uri issued by handle. |
Final |
schema.org |
June 2016 |
Full |
Final |
68. Data Category Linking Vocabulary
Concreteness: abstract
Strength: recommended
Category: WG2
Status: final
Linking elements from different schemas can be modelled in various ways. The most straightforward is the set of coarse grained lightweight thesaural mapping relations expressed by SKOS (https://www.w3.org/2004/02/skos/). This is our preferred linking approach, in combination with the following Owl and RDF object properties: owl:sameAs, rdfs:subClassOf, and rdfs:subPropertyOf, owl:equivalentClass, and owl:equivalentProperty. The compliance assessment checks the use of mentioned linking vocabularies within the KRs themselves, requesting an answer to the question: does the resource contain any of these links?
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
June 2016 |
Full |
Relations are specified using |
Final |
Agrovoc |
21/01/2016 |
Full |
RDF object properties used, see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
JATS |
1.1 |
No |
No linking to other resources. |
Final |
OLiA |
June 2016 |
Full |
OLiA makes us of the relevant OWL and RDF properties, see http://purl.org/olia/olia.owl |
Final |
LAPPS |
June 2016 |
Full |
The LAPPS Web Service Exchange Vocabulary includes several pointers to related concepts, e.g. in IsoCat. |
Final |
DKPro Core |
1.8.0 |
Full |
The DKPro Core type system documentation includes several pointers to related concepts, e.g. in IsoCat, LAPPS, and others. Since not all of the resources pointed to have proper identifiers (e.g. type name conventions in GATE), some pointers are ad-hoc. |
Final |
Ontolex |
June 2016 |
Full |
Documentation is versioned in parallel with the source code |
Final |
CLARIN CCR |
June 2016 |
No |
The CCR does not support linking between concepts; apparently, this will be done through a Relation Registry (which is not yet ready) or at the component/profile level in CMDI. |
Final |
schema.org |
June 2016 |
Full |
RDF properties - see http://schema.org/docs/schema_org_rdfa.html |
Final |
69. Interoperability between elements from different knowledge resource schemas should be expressed through RDF statements.
Concreteness: abstract
Strength: recommended
Category: WG2
Status: final
Description
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
June 2016 |
Full |
Final |
|
Agrovoc |
21/01/2016 |
Full |
RDF statements used, see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
JATS |
1.1 |
N/A |
No linking to other resources |
Final |
OLiA |
June 2016 |
Full |
RDF stateements are used, see http://purl.org/olia/olia.owl |
Final |
LAPPS |
June 2016 |
Full |
The LAPPS Web Service Exchange Vocabulary expresses relations with other resources (specifically IsoCat) using |
Final |
Ontolex |
June 2016 |
Full |
See https://www.w3.org/community/ontolex/wiki/Final_Model_Specification |
Final |
CLARIN CCR |
June-2016 |
No |
No linking to other resources |
Final |
schema.org |
1.2.1 |
No |
RDF is used, see http://schema.org/docs/schema_org_rdfa.html |
Final |
70. All KR content elements need to be added as text annotations within a TDM workflow.
Concreteness: abstract
Strength: mandatory
Category: WG2
Status: final
When working with a knowledge resource (KR), its relevant elements need to be added as text metadata. In other words, any text annotations created from information in the KR should be linked to the URI of the relevant KR element. This enables the TDM modules to work in a uniform manner with the informational content the KRs provide. This is in line with the Linked Data paradigm.
This is a methodological requirement an as such not suited for the compliance check.
71. The KR should be ingestible through a URI
Concreteness: abstract
Strength: recommended
Category: WG2
Status: final
The KR should be downloadable through a URI for ingest (possibly after adaptation) into the TDM workflow.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
June 2016 |
Full |
SPARQL endpoint specified in |
Final |
Agrovoc |
21/01/2016 |
Full |
See links to downloads and endpoints linked from http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
JATS |
1.1 |
Full |
In XSD (and RELAX-NG & DTD formats) and accessible as three different schemas but the user has to select one of these. |
Final |
OLiA |
June 2016 |
Full |
Available as an OWL download, see http://purl.org/olia/olia.owl |
Final |
LAPPS |
June 2016 |
Full |
The LAPPS Web Service Exchange Vocabulary exists is various formats on the website can can be downloaded from there. |
Final |
Ontolex |
June 2016 |
Full |
Final |
|
CLARIN CCR |
June-2016 |
No |
No download link provided; only browsing mode |
Final |
schema.org |
1.2.1 |
Full |
All schemas are hosted at schema.org. Downloads are also avaialable, see e.g. http://schema.org/version/3.1/ |
Final |
72. The KR format should be in a standard format such as XML, JSON-LD or RDF/XML.
Concreteness: abstract
Strength: recommended
Category: WG2
Status: final
Standard formats enable standard ingestion into the TDM workflow.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
June 2016 |
Full |
RDF |
Final |
Agrovoc |
21/01/2016 |
Full |
RDF, see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
JATS |
1.1 |
Full |
Standard format: XSD/RELAX-NG/DTD |
Final |
OLiA |
June 2016 |
Full |
RDF/XML - see http://purl.org/olia/olia.owl |
Final |
LAPPS |
June 2016 |
Full |
The LAPPS Web Service Exchange Vocabulary is available in JSON-LD, RDF/XML and other formats from their website. The RDF/XML and JSON-LD versions can also be obtained by specifying the appropriate Accept header in the HTTP request when accessing |
Final |
Ontolex |
June 2016 |
Full |
See https://www.w3.org/community/ontolex/wiki/Final_Model_Specification |
Final |
CLARIN CCR |
June-2016 |
Full |
Collections as SKOS/RDF |
Final |
schema.org |
1.2.1 |
Full |
RDFa, other formats available |
Final |
73. Stick to widely used data compression formats
Concreteness: concrete
Strength: best practice
Category: WG4
Status: draft
Compression algorithms are used to reduce the size of data. To avoid problems in working with such data (creating or consuming), only algorithms/formats should be used for which there is broad support across platforms and programming languages. There are:
-
gzip (also known as gz)
-
zip
-
xz
-
tar (including tar.gz)
Specifically to be avoided are commercial algorithms and their implementations, e.g.
-
rar
74. Machine-readable metadata for UIMA components
The UIMA framework defines a series of so-called resource specifiers contain metadata about components. The following types of components and associate resource specifiers are supported by OpenMinTeD:
Component | Resource specifier |
---|---|
Reader component |
|
Analysis engine / Writer component |
It is expected that these descriptors can be located within the same JAR file that also contains the component they describe.
The descriptor file should be located next to the Java class file that implements the described component, as it is indicated by the implementationName
element within the descriptor.
The JAR file must contain a locator file called META-INF/org.apache.uima.fit/components.txt
which contains pointers to all component descriptor files within a JAR.
classpath*:my/organizations/domain/name/package/Component.xml
The locator file should be automatically generated using the generate
goal of the uimaFIT Maven Plugin as described in the uimaFIT documentation.
There must be at most one resource specifier per component class. I.e. no two resource specifiers should use the same implementation class.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
ARGO |
0.5 |
Partial |
Argo components are described using standard UIMA XML Descriptor files, however the descriptor files are not required to be located next to the implementation class. Argo components don’t contain the |
Draft |
DKPro Core |
1.8.0 |
Full |
Resource specifier files automatically generated using uimaFIT Maven plugin. |
Draft |
ILSP |
1.2.1 |
Partial |
Resource specifiers (XML component descriptors) are manually maintained. There is one such specifier in the src/main/resources directory of each component or of a package grouping a set of similar components, e.g. a set of reader or |
Draft |
75. Embedding UIMA component metadata into the source code
The uimaFIT library provides Java annotations to embed most types of UIMA metadata as Java annotations directly into the source code of UIMA components. Particularly relevant for OpenMinTeD are the following uimaFIT annotations:
uimaFIT annotation | Description |
---|---|
|
For reader components: defines the supported input media types; for writer components: defines the supported output media types. The media type should be an official IANA media type. |
|
Defines the UIMA types that a component can consume or produce. |
|
Defines the languages a UIMA component supports. The language should be specified according to IETF BCP47. |
|
Identifies a component parameter, whether the parameter is mandatory/optional, and may define a default parameter value. |
Component metadata other than the ones specified are automatically derived from the source code and generated into a UIMA resource specifier by using the uimaFIT Maven plugin goal generate
, e.g. the parameter types, implementation name, etc.
The uimaFIT @ResourceMetaData
annotation must not be used as it would interfere with the need of the Maven POM and the UIMA resource specifier metadata being synchronized.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
ARGO |
0.5 |
Partial |
Some Argo components make use of the uimaFIT annotations, although it is not a requirement of Argo for components to do so. It is planned that, at some point, we will convert all of our existing components to use uimaFIT annotations. |
Draft |
DKPro Core |
1.8.0 |
Partial |
uimaFIT annotations are used. Language codes are expected to be ISO 639-1. Many components use private media types because many supported formats do not have a corresponding media type. |
Draft |
ILSP |
1.2.1 |
Partial |
The main requirement is that components are described with UIMA XML descriptors, accompanied by UIMA AS deployment descriptors that are used when components are deployed as services. |
Draft |
76. Separating UIMA metadata from the component
Storing the component metadata as a XML resource specifier with in the same JAR file that also hosts the described component facilitates separating the metadata from the component.
Specifically, the JAR file can be downloaded and the resource specifier can be located within the JAR using META-INF/org.apache.uima.fit/components.txt
without having to download any (transitive) dependencies. The resource specifier can be immediately used, e.g. for ingestion into the OpenMinTeD registry.
Mind that the uimaFIT annotations embedded in the component’s class file cannot be used as easily. Extracting the metadata from the class file requires resolving the class file, which in turn requires having access to the transitive dependencies. This would entail a much larger download and would be impracticable for the OpenMinTeD registry crawler.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
ARGO |
0.5 |
Full |
UIMA resource specifiers are manually maintained by the resource developers. |
Draft |
DKPro Core |
1.8.0 |
Full |
UIMA resource specifiers are auto-generated using the uimaFIT Maven plugin. |
Draft |
ILSP |
1.2.1 |
Full |
UIMA Component Descriptor XML files are manually maintained. |
Draft |
77. Unique identifiers and versions for UIMA components
Important
|
This requirement has been deprecated. It has been merged with REQ-82 into REQ-96. |
The unique identifier for a UIMA component is constructed from the following coordinates that are either obtained from the Maven POM or from the UIMA resource specifier.
Coordinate | Description | Source |
---|---|---|
groupId |
Group ID of the Maven artifact containing the component |
Maven POM |
artifactId |
Artifact ID of the Maven artifact containing the component |
Maven POM |
version |
Version of the Maven artifact containing the component |
Maven POM |
packaging |
optional Packaging of the Maven artifact containing the component |
Maven POM |
classifier |
optional Classifier of the Maven artifact containing the component |
Maven POM |
class |
Name of the class implementing the component |
UIMA resource specifier |
The identifier has the following forms depending on the presence of the optional coordinates:
uima-mvn:{groupId}:{artifactId}:{version}#{class} uima-mvn:{groupId}:{artifactId}:{classifier}:{version}#{class} uima-mvn:{groupId}:{artifactId}:{packaging}:{classifier}:{version}#{class}
uima-mvn:my.organization.tdm:dictionary-tools:1.0.1#my.organization.tdm.DictionaryLookup
Note
|
The version is also available from the UIMA resource specifier. It must be in sync with the version specified in the Maven POM. The uimaFIT Maven plugin goal generate ensures this by determining the version in the generated UIMA resource specifier from the version specified in the Maven POM. The overrideComponentVersion and componentVersion settings in the uimaFIT MAven plugin must not be used. Neither must the uimaFIT @ResourceMetaData annotation be used to manually provide a version.
|
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
ARGO |
0.5 |
Unknown |
Unknown |
Draft |
DKPro Core |
1.8.0 |
Full |
UIMA components are provided as Maven artifacts. |
Draft |
ILSP |
Unknown |
Unknown |
Draft |
78. Specifying input and output types of UIMA components
The input and output of UIMA components should be declared in the component’s resource specified using type capabilities. Capabilities should be declared in the component’s source code using the uimaFIT @TypeCapability
annotation.
@TypeCapability(
inputs = {
"de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token",
"de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence" },
outputs = {
"de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS" })
It may not always be known what types a component produces or consumes. For example, the production/consumption of some types might be optional. Optional types should always be declared, even if they are not always produced/consumed.
The types being produced or consumed by truly generic components may depend on their specific parameter settings or on they models they are configured with (e.g. a rule-based transducer). Such components should not be expected to declare input/output types.
UIMA type capabilities only specify the names of input and output types, but not where to locate the type system information or which version of the type system is used.
OpenMinTeD may therefore check the dependencies of the Maven artifact containing the component for type system declarations using the META-INF/org.apache.uima.fit/types.txt
locator mechanism.
Note
|
Searching for types in the transitive dependencies of a component requires downloading all the transitive dependencies. This is something we explicitly wanted to avoid in 3. Component metadata must be separable from the component. |
Note
|
Annotation types for input and output resources can be mapped to the OMTD-SHARE element annotationLevel embedded in the inputContentResourceInfo and the outputResourceInfo respectively. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
ARGO |
0.5 |
Partial |
All components have the ability (within their UIMA XML descriptor) to declare their input/output types, however existing Argo components generally lack this information. This is something we will correct. |
Draft |
DKPro Core |
1.8.0 |
Full |
Input/output capabilities declared where possible. |
Draft |
ILSP |
1.2.1 |
Full |
input and output types of UIMA components are specified in each component’s XML descriptor. |
Draft |
79. Documentation of UIMA components
The documentation of UIMA components should be maintained as JavaDoc in the component source code. The uimaFIT Maven plugin goal enhance
picks the documentation of components and parameters up from there and embeds it into the compiled class. The goal generate
finally transfers the documentation into the automatically generated UIMA resource specifiers.
The description properties of the uimaFIT @ConfigurationParameter
and of the @ResourceMetaData
annotations should not be used.
Component documentation should be provided as JavaDoc on the component class itself.
/**
* Part-of-Speech annotator using OpenNLP.
*/
public class OpenNlpPosTagger
extends JCasAnnotator_ImplBase
{
Parameter documentation should be placed before a constant declaration which defines the parameters name as shown in the example below. This ensures that IDEs show the parameters and associated documentation when auto-completion is invoked on a UIMA component. It also allows the uimaFIT Maven plugin goal enhance
to pick up the documentation and to automatically transfer it into the description property.
/**
* Use this language instead of the document language to resolve the model.
*/
public static final String PARAM_LANGUAGE = "language";
@ConfigurationParameter(name = PARAM_LANGUAGE, mandatory = false)
protected String language;
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
ARGO |
0.5 |
Partial |
Documentation is manually maintained within the UIMA resource specifiers. |
Draft |
DKPro Core |
1.8.0 |
Full |
Documentation is maintained as JavaDoc and transferred into UIMA resource specifiers using the uimaFIT Maven plugin. |
Draft |
ILSP |
1.2.1 |
Partial |
Partial documentation of component source code is maintained and generated as JavaDoc. |
Draft |
80. Common elements to represent/describe an executable workflow
Concreteness: concrete
Strength: recommended
Category: WG4
Status: deprecated
Important
|
This requirement has been deprecated. Its parent abstract requirement (REQ-18) should be covered by the functionality of the OpenMinTeD platform, namely Galaxy workflow editor. |
A uniform workflow language allows to use the same elements of language to represent/describe a workflow (See * 18. Workflows should be described using an uniform language. For example, Common Workflow Language (CWL) is a workflow language used in many systems, for example Galaxy. Common parts to represent an executable workflow are the following:
Parts | Description |
---|---|
Main_desc |
To provide information about the workflow. |
Main_inputs |
To define the inputs parameters of the workflow. |
Main_outputs |
To define the output parameters of the workflow. |
Steps |
To define the steps of the workflows. |
Main_desc
is the part of the workflow representation that gives information about the workflow. It contains the following fields:
Field | Description |
---|---|
name |
Name of the workflow. |
command |
Define the command to use for invoking the workflow. |
version |
Version of the workflow. |
label |
A human readable description of the workflow. |
Main_inputs
and Main_outputs
are parts of the workflow representation that respectively define the inputs and outputs parameters
of the workflow. Each parameter
contains at least the following fields:
Field | Description |
---|---|
id |
to identify this parameter. |
type |
type of values this parameter accepts e.g. float, string, File … |
value |
to specify a default value for this parameter, if exists. |
Steps
is the part of the workflow representation that defines the list of modules
and the list of dependencies
between modules of the workflow. Each module
is defined with the following fields :
Field | Description |
---|---|
name |
name of the module. |
command |
define the command to use for invoking this module. |
inputs |
define the inputs parameters of this module. See the |
outputs |
define the outputs parameters of this module. See the |
The dependencies between the modules define the dataflow of the workflow. They tells what outputs are fed by inputs for execution. Each dependency
is defined with the following fields:
Field | Description |
---|---|
module.output.id |
Is the |
module.input.id |
Is the |
Note
|
All elements described are supported by CWL that is built on technologies such as JSON-LD for data modeling and Docker for portable runtime environments. Another part, partially supported by CWL, can be considered to add controls like loops, conditions to the workflow.
|
81. Embedding GATE component metadata into the source code
GATE provides a number of Javaannotations that can be used to embed CREOLE metadata into the source of GATE components. The annotations relevant to OpenMinTeD are as follows:
Annotation |
Description |
|
used to annotate a class to show that it is a GATE resource. The type of resource (processing resource, language resource, etc.) depends upon the class hierarchy. |
|
used to annotate a bean setter method to indicate that the property defines a CREOLE parameter. A parameter can be provided with a default value using the |
All other metadata is determined at runtime via inspection of the class.
Product |
Version |
Compliant |
Justification |
Status |
GATE |
8.2 |
Full |
Components can be fully specified through Java annotations |
draft |
82. Unique identifiers and versions for GATE components
Important
|
This requirement has been deprecated. It has been merged with REQ-77 into REQ-96. |
Each GATE component is a Java class and as such can be uniquely identified by it’s classname. Components are to be distributed via Maven and versioning information will be recorded as part of the POM. Together these two sources provide the following information
Coordinate | Description | Source |
---|---|---|
groupId |
Group ID of the Maven artifact containing the component |
Maven POM |
artifactId |
Artifact ID of the Maven artifact containing the component |
Maven POM |
version |
Version of the Maven artifact containing the component |
Maven POM |
class |
Name of the class implementing the component |
GATE CREOLE Metadata (derived from classname) |
Product |
Version |
Compliant |
Justification |
Status |
GATE |
8.2 |
Partial |
Components are not yet distributed by Maven and as such versioning info is not available |
draft |
83. Documentation of GATE components
GATE components and their parameters are documented in two ways; CREOLE metadata and JavaDoc comments.
It is possible to give each component and parameter a name and a human readable comment (used as a tooltip in the GUI) via the CREOLE metadata. Components can also be given a URL of more extensive documentation that the user can be redirected to. Combined with the way the GUI restricts configuration fields based upon automatically extracted metadata this documentation should be enough to aid regular users in configuring and using a component.
JavaDoc comments are used to provide more API style documentation of each component and parameter and it is envisaged that this information is useful to those trying to embed GATE within another application rather than using the GUI.
Product |
Version |
Compliant |
Justification |
Status |
GATE |
8.2 |
Partial |
While GATE provides ways of documenting components it is difficult to enforce this on all components especially those developed by 3rd parties |
draft |
84. Separating GATE metadata from the component
CREOLE metadata can be completely defined using an XML file (called creole.xml
located at the root of a GATE plugin),
however, we don’t recommend this as it is too easy for embedded metadata and the XML version to go out of sync. It is
useful though to have a standalone copy of the metadata as extracting it from the class file requires resolving the
class, which in turn requires having access to the transitive dependencies which would be inefficient in situations
such as the OpenMinTeD registry crawler.
We recommend that during the build process a fully expanded copy of the metadata is produced and stored within the
resulting JAR as META-INF/gate/creole.xml
so that other systems, such as the OpenMinTeD registry crawler, can read
the metadata directly without reference to any of the class files of the components or their transitive dependencies.
Product |
Version |
Compliant |
Justification |
Status |
GATE |
8.2 |
Partial |
While we provide the tools needed to produce the standalone metadata it is not currently part of the standard build process and so rarely happens. This will change with the next version of GATE. |
draft |
85. Unique identifier and version for components in the OMTD-SHARE schema
Important
|
This requirement has been deprecated. It has been replaced with REQ-96. |
The OMTD-SHARE schema includes specific distinct metadata elements for identifier and version of all resources. More specifically:
Name | Element/Attribute | Type | Strength | Repeatability | Description |
---|---|---|---|---|---|
identifier |
element |
xs:string |
mandatory |
yes |
Reference to a PID, DOI or any kind of identifier used by the resource provider for the resource; must obligatorily be used with the attribute resourceIdentifierSchemeName |
resourceIdentifierSchemeName |
attribute |
open controlled vocabulary |
mandatory |
no |
Reference to the resource scheme that has been used for the identifier (e.g. DOI, handle etc.); the list of values includes the most widespread schemes used for resources |
schemeURI |
attribute |
URI |
mandatory when applicable |
no |
Reference to a URI of the resource scheme that has been used for the identifier, providing more information about the resource scheme |
version |
attribute |
xs:string |
recommended |
no |
Any string, usually a number, that identifies the version of a resource |
Components that already include an identifier and version from previous metadata records can map them to these elements.
Note
|
For components collected from Maven, the Maven id (groupId, artifactId, versionId) can be used with a reference to the Maven scheme (i.e. use the attribute schemeURI with the value"https://maven.apache.org/pom.html#Maven_Coordinates"). |
Note
|
There is an ongoing discussion about a unique Persistent Identifier (PID) and the procedure for assigning it which may lead to further specifications. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
Unknown |
Unknown |
Draft |
|
ARGO |
0.5 |
Unknown |
Unknown |
Draft |
DKPro Core |
1.8.0 |
Unknown |
Unknown |
Draft |
GATE |
8.2 |
Unknown |
Unknown |
Draft |
ILSP |
Unknown |
Unknown |
Draft |
86. Attaching format properties to the description of the inputs and outputs that use file
Important
|
This requirement has been deprecated. It is not necessary, as the associated abstract REQ-45 is implemented by REQ-88. |
Some components (tools) will produce and consume files. To reduce mismatches and help to properly handle the files,
the following properties/features about the file formats
are relevant (to attach to the description of the
inputs/outputs that use file
) :
Feature | Strength | Description |
---|---|---|
id |
required |
identify the format |
label |
optional |
label or name to display |
mimeType |
optional |
mimeType as described in the metadata (e.g., ms:mimeType). |
encoding |
optional |
identify the character encoding as described in the metadata (e.g., ms:characterEncoding) |
parents |
optional |
formats from which this format is derived from, that allows to maintain a hierarchy of the formats. (Galaxy administrators do manage data types based on a hierarchy of formats) |
doc_url |
optional |
link to the complete documentation provided in metadata (e.g., ms:documentationURL) |
The features are normally present in the metadata (e.g., ms:dataFormat) and in the embedded annotations (e.g.,
java annotations). To simplify, we can suppose that id
is the only required element, even if all may be important
for the users (e.g., to identify a specific file), the developers (e.g., to understand a file for implementation) or
tools (e.g., to automatically process a file).
87. Embedding language capability in UIMA component metadata
Important
|
The abstract requirement is now covered by requirement 75. |
This requirement can be fulfilled with implementation of requirement 75. However, it is common that components take a model as input parameter (for instance in the case of DKPro Core) and hence the languages supported by the component are determined by the available models. In such cases language information can not be encoded in the component metadata. One option is to implement a functionality in the registry to look up available models for each component and update the component language capabilities based on the model’s language. This function can be triggered every time a new model is added to the registry.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
ARGO |
0.5 |
Partial |
Argo components can embed language metadata in their UIMA XML descriptors, however most components do not declare this information. |
Draft |
DKPro Core |
1.8.0 |
Partial |
Most DKPro Core components use language-specific models. It is impossible to know the full set up models for a component as new models can be created by users at any time. For those components where the language support is hard-coded (i.e. not determined by the model used), the component ususally bears language metadata in the form of a uimaFIT LanguageCapability annotation which translates into a language capability declaration in the component’s UIMA descriptor. |
Draft |
ILSP |
1.2.1 |
Unknown |
Unknown |
Draft |
88. Embedding output format in UIMA component metadata
This requirement can be fulfilled with implementation of requirement 75. The output format of UIMA components can be defined using @MimeTypeCapability annotation.
Note
|
This can be mapped to the OMTD-SHARE schema elements mimeType, which should take values from the IANA mimetype controlled vocabulary, and, in case this is not sufficient, dataFormatSpecific (a free text field for specifying a name of a data format) and documentationURL (link to a URL with more infomation about the format), all embedded in outputResourceInfo. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
ARGO |
0.5 |
No |
At present users can only deduce the output format of a UIMA component from either the name of a component or its textual description. Argo components will eventually be able to use the @MimeTypeCapability annotation when uimaFIT becomes a first-class citizen within Argo. |
Draft |
DKPro Core |
1.8.0 |
Full |
Components define their output format using @MimeTypeCapability annotation. |
Draft |
ILSP |
1.2.1 |
No |
Not supported |
Draft |
89. Version documentation in parallel with component/resource
To keep the source material (i.e. component/resource) and documentation in sync, it is a good practice to version documentation along with the source material with identical versions. This means that documentation should reside in a subdirectory inside the source directory. It is also recommended that the source directory is stored under a version control repository. Every time a release is made, contents of the release should be identical to a specific time-state of the repository. This guarantees that each release of the source material contains its corresponding documentation and they are always kept in sync.
For software (i.e. components) the documentation should be stored in a module dedicated for this purpose (e.g. for softwareX there could be softwareX-documentation module).
In some cases a certain version of documentation needs to be updated; for instance, to fix errors or provide more examples (not typical for software). In this scenario, it is recommended to publish the updated documentation with the same version identifier but with an updated revision number. For example, if the first documentation is released with version v2.0 rev.1 then the updated documentation should be published with version v2.0 rev.2. As an alternative, update date of the document can be attached instead of the revision identifier.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
Jun-16 |
No |
Documentation is not versioned (http://lod.gesis.org/thesoz/de/help.html) |
Final |
Agrovoc |
21/01/2016 |
No |
Docuemtnation (such as FAQs and service descriptions) are not versioned, see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
final |
OLiA |
Jun-16 |
No |
No documentation |
Final |
LAPPS |
Jun-16 |
No |
The LAPPS Web Service Exchange Vocabulary resides in a version control system, but it does not have a release cycle, unlike the Groovy-based API. Rather, the vocabulary is updated and extended without explicit versioning. It is not clear which verison of the API is compatible with which state of the vocabulary. |
Final |
Licences |
Jun-16 |
Unknown |
Version may be provided, yet documentation may use reference to "any later version" (see also 50. Documentation references should be versioned) |
Final |
Alvis |
0.5rc |
No |
No version |
Final |
ARGO |
0.5 |
No |
No versioning of the documentation is performed or enforced by Argo, however this is something that could be adopted. |
Draft |
DKPro Core |
1.8.0 |
Full |
Documentation is versioned in parallel with the source code |
Final |
GATE |
8.2 |
No |
Possible for all GATE components to be written this way although we don’t currently enforce it in any way |
Final |
ILSP |
1.2.1 |
No |
No version |
Draft |
90. Components must be assigned at least one category from the OMTD-SHARE controlled vocabulary for component types
The OMTD-SHARE metadata schema includes the element componentType which can be used for classifying components with values taken from a controlled vocabulary created in the OpenMinTeD project (implemented in SKOS: pending link). Each component must be assigned at least one value that best describes its function/task; the assignment of multiple values is possible, allowing for multiple classification. The current set of values focuses on terms of functions/tasks used by NLP and TDM experts, but it will be extended to include terms for non-expert users, catering for the description of integrated complex tasks.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Not supported yet, will be done manually for each component. |
Draft |
ARGO |
0.5 |
No |
This requirement could be fulfilled through a change to the specification of the Argo Descriptor file. |
Draft |
DKPro Core |
1.8.0 |
No |
DKPro Core uses naming conventions for components and DKPro Meta implements a set of rules to derive component types from these naming conventions. However, the component types used by DKPro Meta (and consequently the DKPro Core reference documentation) do not fully correspond to the OMTD-SHARE component types. Also, the component type is not explicitly included with the respective component metadata. It can only be found in the reference documentation. |
Draft |
GATE |
8.2 |
No |
There are some naming conventions used with the GATE plugins developed in Sheffield that could be mapped to these categories but in general we don’t currently support this requirement in any way. |
Draft |
ILSP |
1.2.1 |
No |
Components are not assigned categories yet. |
Draft |
91. Encoding citable publications (for scholarly attribution) in resource metadata records
It is customary for resource creators/owners to ask for scholarly attribution through the citation of publications; these are usually publications that describe the resource, its functionalities, design principles etc. The OMTD-SHARE metadata schema includes a specific metadata element for this, namely 'mustBeCitedWith', which is recommended and can be filled in with
-
the identifier of a publication (‘publicationIdentifier’) or
-
a free text field (‘documentUnstructured’), which can be used for recording a whole bibliographic record.
The preferred option is adding the identifier of the publication (with the appropriate attributes).
Name | Element/Attribute | Type | Strength | Repeatability | Description |
---|---|---|---|---|---|
publicationIdentifier |
element |
xs:string |
mandatory (choice with documentUnstructured) |
yes |
Reference to a PID, DOI or any kind of identifier used for the publication; must obligatorily be used with the attribute publicationIdentifierSchemeName |
publicationIdentifierSchemeName |
attribute |
open controlled vocabulary |
mandatory |
no |
Reference to the publication scheme that has been used for the identifier (e.g. DOI, handle, arXiv, bibcode etc.); the list of values includes the most widespread schemes used for publications |
schemeURI |
attribute |
URI |
mandatory when applicable |
no |
Reference to a URI of the resource scheme that has been used for the identifier, providing more information about the resource scheme |
documentUnstructured |
element |
xs:string |
mandatory (choice with publicationIdentifier) |
no |
Used either as a free text description or as an unstructured (free form) presentation of a bibliographic record; alternative to the structured presentation of a document |
Further specifications according to this requirement:
-
for components, in accordance with 2. Component metadata have to be embedded into the component source code, this information should be provided in the component metadata. The recommended way to do this is by adding Java annotations in the GATE and UIMA descriptors specific to this element. We recommend the usage of the @citation element. Within this element the author should place one or more bibliographic records, preferably in bibtex format.
Note
|
For clarification reasons, it is noted that this relation is different from the ‘datacite:relationType’ isCitedBy, which records the actual citation of the resource in various publications; in fact, it is closer to the isDocumentedIn and the dc:bibliographicCitation but this relation type doesn’t highlight the preferential nature of the citable publication. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
No |
Not supported for the moment. |
Draft |
ARGO |
0.5 |
No |
No plans to support this. |
Draft |
DKPro Core |
1.8.0 |
No |
Occasionally, pointers to relevant papers are included in the documentation or metadata of components, but this information is only indicative. The DKPro Core website explicitly asks users to look up relevant papers for the components they use - or more specifically - the tools they wrap and the models that are used with them. |
Draft |
GATE |
8.2 |
No |
There may be citations within the documentation but nothing explicit. If there is a java annotation that could be used then we’d have no issue with adding this, where appropriate, to existing components and recommending it for future component development |
Draft |
ILSP |
1.2.1 |
No |
The descriptors of certain components may include links to papers, but not in specifc resource metadata records |
Draft |
TheSoz |
June-16 |
No |
No citing information encoded inside metadata |
Final |
Agrovoc |
21/01/2016 |
No |
Publications are available at http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus but are not cited in metadata |
Draft |
OLiA |
June-16 |
No |
References are available at http://nachhalt.sfb632.uni-potsdam.de/owl/ but are not encoded in metadata |
Draft |
LAPPS |
June-16 |
No |
The LAPPS Web Service Exchange Vocabulary does not define any metadata elements related to attribution, and no refernce is provided on the download site. |
Draft |
92. Including license text in resource packages
The license text of a resources should be included with a packaged resource. The exact way this is done depends on the type of packaging.
In ZIP files, the license should be located at the root of the file in a plain text file called
LICENSE
. The file may optionally have the suffix .txt
. If multiple licenses apply, all of them
should be concatenated in that file.
In Java JAR files, the license should be located in a plain text file META-INF/LICENSE
. The file
may optionally have the suffix .txt
. If multiple licenses apply, all of them should be concatenated
in that file.
FIXME
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Exist for the corpus |
Draft |
ARGO |
0.5 |
Partial |
Some existing Argo components contain a LICENSE.txt file, however this is not always the case and these license files don’t get included with the packaged components. |
Draft |
DKPro Core |
1.8.0 |
Partial |
JAR files for DKPro Core components contain a LICENSE.txt file. JAR files for models so far usually do not contain a LICENSE.txt file. |
Draft |
GATE |
8.2 |
No |
Some plugins do do this, but in general it’s not the case. The next version of GATE will use maven for plugin build/distribution at which time the license will be listed in the pom.xml. We don’t currently envisage adding the full license text into the META-INF folder, mostly as users will never have access to it, but information from the pom.xml may well be exposed. |
Draft |
ILSP |
1.2.1 |
No |
License text of resources is not included. |
Draft |
93. Provide identifiers for knowledge resource elements
Concreteness: concrete
Strength: recommended
Category: WG2
Status: draft
When making use of multiple knowledge resources, TDM workflows may need to identify individual knowledge resource elements and knowledge resource schema elements. For example,
-
A workflow may need to access information about an element.
-
Multiple resources may be harmonised, requiring the relationships between schema elements to be specified.
For this identification purpose, knowledge resource schema elements should be identified with a URI. For Linked Data resources, the following identifiers should be used:
-
JSON-LD - the @id keyword
-
RDF/XML - the attributes xml:base, rdf:ID and rdf:about
-
XML - the xml:id attribute
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
June 2016 |
Full |
knowledge resource elements are identified via |
Final |
Agrovoc |
21/01/2016 |
Full |
The usual RDF identifiers used, see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
JATS |
1.1 |
Partial |
No URI but JATS is a formal DTD and each element/attribute has its own id. |
Final |
OLiA |
June 2016 |
Full |
Knowledge resouce elements are identified with URIs. See http://nachhalt.sfb632.uni-potsdam.de/owl/ |
Final |
LAPPS |
June 2016 |
Full |
The LAPPS Web Service Exchange Vocabulary is available in JSON-LD and RDF/XML and uses the respective mechanisms for the identification of elements via a URI. |
Final |
Ontolex |
June 2016 |
Full |
See https://www.w3.org/community/ontolex/wiki/Final_Model_Specification |
Final |
CLARIN CCR |
June 2016 |
Full |
Each concept is assigned a handle URI |
Final |
schema.org |
June 2016 |
Full |
Final |
94. Data Category Linking Vocabulary
Concreteness: concrete
Strength: recommended
Category: WG2
Status: draft
Knowledge resource authors should provide linkage between their own resource, other resources in their domain, and generic resources such as citation standards. This enables TDM workflows that make use of multiple resources to resolve relationships between those resources.
Where linking between knowledge resources is modeled by a Linked Data resource, mapping should be expressed through RDF statements, using relations from SKOS, together with the following OWL and RDF object properties:
-
owl:sameAs
-
owl:equivalentClass
-
owl:equivalentProperty
-
rdfs:subClassOf
-
rdfs:subPropertyOf
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
June 2016 |
Full |
Interlinking to the following resources are defined in |
Final |
Agrovoc |
21/01/2016 |
Full |
OWL and RDF properties used, see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
JATS |
1.1 |
No |
No linking with other resources |
Final |
OLiA |
June 2016 |
Full |
OLiA makes us of the relevant OWL and RDF properties, see http://purl.org/olia/olia.owl |
Final |
LAPPS |
June 2016 |
Full |
The LAPPS Web Service Exchange Vocabulary expresses relations with other resources (specifically IsoCat) using |
Final |
Ontolex |
June 2016 |
Full |
See https://www.w3.org/community/ontolex/wiki/Final_Model_Specification |
Final |
CLARIN CCR |
June 2016 |
No |
No linking with other resources |
Final |
schema.org |
June 2016 |
Full |
Final |
95. The KR format should be in a standard format such as XML, JSON-LD or RDF/XML.
Concreteness: concrete
Strength: recommended
Category: WG2
Status: draft
Standard formats for knowledge resources enables TDM to make use of common software infrastructure for ingesting and using knowledge resources, and can ease knowledge resource interoperability. Knowledge resources should use XML or JSON based syntax. For Linked Data resources, any RDF format may be used, for example JSON-LD or RDF/XML. Other formats are acceptable for resources that are not based on Linked Data.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
TheSoz |
June 2016 |
Full |
RDF |
Final |
Agrovoc |
21/01/2016 |
Full |
Published as RDF/XML, see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
JATS |
1.1 |
Partial |
Standard format: XSD/RELAX-NG/DTD |
Final |
OLiA |
June 2016 |
Full |
RDF/XML - see http://purl.org/olia/olia.owl |
Final |
LAPPS |
June 2016 |
Full |
The LAPPS Web Service Exchange Vocabulary is available in JSON-LD, RDF/XML and other formats from their website. The RDF/XML and JSON-LD versions can also be obtained by specifying the appropriate Accept header in the HTTP request when accessing |
Final |
Ontolex |
June 2016 |
Full |
See https://www.w3.org/community/ontolex/wiki/Final_Model_Specification |
Final |
CLARIN CCR |
June-2016 |
Full |
SKOS/RDF |
Final |
schema.org |
1.2.1 |
Full |
RDFa - see http://schema.org/docs/datamodel.html |
Final |
96. Unique identifiers and versions for components using Maven
Each GATE or UIMA component is a Java class and as such can be uniquely identified by its classname. Components are to be distributed via Maven and versioning information will be recorded as part of the POM. Together these two sources provide the following information
Coordinate | Description | Source |
---|---|---|
groupId |
Group ID of the Maven artifact containing the component |
Maven POM |
artifactId |
Artifact ID of the Maven artifact containing the component |
Maven POM |
version |
Version of the Maven artifact containing the component |
Maven POM |
packaging |
optional Packaging of the Maven artifact containing the component |
Maven POM |
classifier |
optional Classifier of the Maven artifact containing the component |
Maven POM |
class |
Name of the class implementing the component |
UIMA resource specifier or GATE CREOLE Metadata (derived from classname) |
In UIMA, the identifier has the following forms depending on the presence of the optional coordinates:
uima-mvn:{groupId}:{artifactId}:{version}#{class} uima-mvn:{groupId}:{artifactId}:{classifier}:{version}#{class} uima-mvn:{groupId}:{artifactId}:{packaging}:{classifier}:{version}#{class}
uima-mvn:my.organization.tdm:dictionary-tools:1.0.1#my.organization.tdm.DictionaryLookup
Note
|
The version is also available from the UIMA resource specifier. It must be in sync with the version specified in the Maven POM. The uimaFIT Maven plugin goal generate ensures this by determining the version in the generated UIMA resource specifier from the version specified in the Maven POM. The overrideComponentVersion and componentVersion settings in the uimaFIT MAven plugin must not be used. Neither must the uimaFIT @ResourceMetaData annotation be used to manually provide a version.
|
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Alvis is provided as Maven artifacts. Decomposing each component as a maven sub-artfifact is not done. |
Draft |
ARGO |
0.5 |
Full |
All Argo components are available as Maven artifacts, however the Argo platform itself only supports UIMA PEAR files , which are produced from these Maven artifacts. |
Draft |
DKPro Core |
1.8.0 |
Full |
UIMA components are provided as Maven artifacts. |
Draft |
ILSP |
1.2.1 |
Partial |
Most components are developed as Maven artifacts but are not distributed via any publicly available Maven repository |
Draft |
GATE |
8.2 |
Partial |
Components are not yet distributed by Maven and as such versioning info is not available |
Draft |
97. Declaring scaleout capability in UIMA
Concreteness: concrete
Strength: mandatory
Category: WG4
Status: draft
Whether or not a component maybe replicated/scaled up within a workflow by the workflow engine can
be declared in UIMA using the multipleDeploymentAllowed
flag. If this flag is set to true
, then
the workflow engine may create multiple instances of the component and run them in parallel with
the implicit effect that each component may only see a subset of the documents to be processed.
Components that write/export data should set this flag to false
.
The flag can be set using a uimaFIT annotation on the component class.
multipleDeploymentAllowed
using uimaFIT@OperationalProperties(multipleDeploymentAllowed=false) class MyComponentClass
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
ARGO |
0.5 |
Full |
UIMA Components can be deployed multiple times within a workflow execution if this flag is set and it is not an instance of CollectionReader, CasConsumer, WebServiceReader, WebServiceConsumer, InteractiveAnalysisEngine |
Draft |
DKPro Core |
1.8.0 |
Full |
Flag is (un)set depending on the type of component. Processing components can usually be scaled while writers cannot. |
Draft |
ILSP |
1.2.1 |
Full |
UIMA-based components have the |
Unknown |
98. Publishing components via software repositories (Maven, Docker)
Concreteness: concrete
Strength: mandatory
Category: WG4
Status: draft
In order to make components downloadable, they must be published either via a Maven repository or via a Docker repository. The respective repositories used should be Maven Central and DockerHub as these are by default supported by related tooling.
The process of publishing software via Maven Central is detailed in the OSSRH Guide.
The process of publishing software via DockerHub is detailed here.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
The Alvis engine is publish via a maven repository, Dockering Alvis is in progress. Note that, into Alvis, an executable component is a couple of <Alvis engine + the component>. |
Draft |
ARGO |
0.5 |
No |
Components are installed into an Argo instance using a script with UIMA PEAR files; Argo does not directly interact with a Maven/Docker repository. |
Draft |
DKPro Core |
1.8.0 |
Full |
DKPro Core is fully available from Maven Central. |
Draft |
ILSP |
1.2.1 |
No |
Components are deployed as UIMA-AS services. They are not available from Maven/Docker repositories. |
Draft |
GATE |
8.5-SNAPSHOT |
Full |
Historically plugins could not be hosted in Maven due to the way the GATE plugin mechanism works. This has changed with the latest development release, and now both GATE itself and the plugins can be stored in Maven. |
Draft |
99. Encoding in the metadata a direct access link for content resources
Concreteness: concrete
Strength: mandatory
Category: WG1
Status: draft
The contents of a resource that is going to be processed for TDM purposes (scholarly publications in OpenMinTeD) must be directly accessible by components in a single step. In order to avoid confusion and misconceptions, the access location of the resource must be provided in a metadata element conceived for this purpose and not through the identifier(s).
The following recommendations, therefore, are in order:
-
aggregators (OpenAIRE and CORE) and other content resource providers that register their resources directly to the OpenMinTeD registry must use the element 'downloadURL' for encoding only direct URL’s, i.e. leading to the file(s) of the resource, and not for landing pages
-
publishers, journals, literature repositories, archives, etc. that wish to provide access to their publications for TDM purposes can do so through OpenAIRE and CORE, following their respective guidelines.
Note
|
The element 'downloadURL' can be repeated for multiple exact copies of the same publication (e.g. PDF versions of the publication distributed at various websites). For different variants of the same publication, e.g. in PDF and HTML format, the whole 'documentDistributionInfo' module can be repeated, with the relevant information encoded in the 'downloadURL' and 'mimetype' elements. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Partial |
Access link can be crafted using the internal identifiers - same as requirement 4. So content resources (in CORE case the referenced pdf) should be accessible from CORE’s API : /articles/get/{coreId}/download/pdf |
Draft |
|
OpenAIRE |
Partial |
Provision of the access link in the metadata is supported in the OpenAIRE record schema (result/fulltext). |
Draft |
|
Frontiers |
Full |
Frontiers contents are discoverable by attached images (single step) or links (multiple steps). Discoverable through OpenAIRE will be covered in this project. |
Draft |
100. Providing access to content resources (sharing/exposing and transferring)
Concreteness: concrete
Strength: mandatory
Category: WG1
Status: draft
The contents of a resource that is going to be processed for TDM purposes (scholarly publications in OpenMinTeD) must be directly accessible by components in a single step. Publications are registered to OpenMinTeD via OpenAIRE and CORE. End-users … (user corpus building process via RESTful APIs) OpenAIRE and CORE: harvesting process (OAI-PMH) content connector from publishers: ResourceSync
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Full |
Access is provided via CORE API (recommended) or by manually crafting download URL: https://core.ac.uk/api-v2/articles/get/{coreid}/download/pdf?apiKey=20hIsS1F5j4D2C2iXrg4Wxf7VTp4Xt1j (recommended way via API Or http://dev.core.ac.uk/download/pdf/{coreid}.pdf (supported but not recommended) |
Draft |
|
OpenAIRE |
No |
Direct access to the content of a resource is currently restricted. |
Draft |
|
Frontiers |
Partial |
Full compliance by May. Today, Frontiers resources are available in repositories such as PubMed and CrossRef. Deposit to DOAJ is in progress for completion by May. Frontiers own repository is being implemented for completion by May. This means Frontiers resources are publicly available provided credentials are validated. Direct access link to content resources via OpenAire will be covered in this project. |
Draft |
101. Making models and annotation resources accessible as entities distinct from the components they are compatible with
To ensure that this requirement is fully specified, it is recommended that models and resources that can be used for annotating content resources (e.g. tagsets, lexica, ontologies etc.) are made accessible and described as separate entities from the components they can be deployed with. This means that for models and other resources embedded in Maven artifacts together with components, they should be packaged as separate entities but can still be stored in Maven, where they will be collected from. Further to this, a set of metadata elements that can be used for the automatic identification of compatible ancillary resources and s/w components must be filled in their descriptions; these include mainly:
-
language
(for all resources) -
variant
(for models) -
format, i.e.
mimetype
and, optionally,dataFormatSpecific
&documentationURL
(for all resources) -
relationType
with valueisCompatibleWith
: a relation that can be used to encode compatible resources, e.g. a model that can potentially be used with two or more taggers, or a tagger that can use all models in a specific format; it is encoded as an external relation linking two resources, through the resource identifier (preferably) or the resource name.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
Partial |
Possible but not done yet |
Draft |
|
ARGO |
0.5 |
No |
Argo doesn’t currently have a concept of resources, other than components or type systems. Models are either bundled into components (which are distributed as UIMA PEAR files) or live within a user’s Argo filestore and passed to a component through configuration parameters, but there is no standard to this. |
Draft |
DKPro Core |
1.8.0 |
Partial |
Models are shipped as separate maven artifacts (there may be very few exceptions). Compatibility between components and models is recorded in the POM of the component in DKPro Core via the Maven Managed Dependencies mechanism. Mimetype, dataFormatSpecific are not recorded. Some models include links to the relevant websites. This is planned to be extended in future. |
Draft |
GATE |
8.2 |
No |
Currently (default) resources are stored in the same artifact as the components. Also while GATE9 can load components from maven artifacts that are not CREOLE plugins cannot be loaded and models/resources are not treated as plugins. A workflow system that uses the GATE API could however pull in resources via maven which are not GATE plugins. |
Draft |
ILSP |
1.1.2 |
No |
Not supported |
Draft |
TheSoz |
June-16 |
Partial |
Language attributes are stored per concept and term, not on the whole resource. Vocabulary and schema of the resource
are declared in the |
Final |
Agrovoc |
21/01/2016 |
Full |
Agrovoc ships as a resource independent of any processing component,see http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Draft |
OLiA |
June-16 |
No |
Not shipped as a Maven artefact |
Draft |
LAPPS |
June-16 |
No |
The LAPPS Web Service Exchange Vocabulary does not include any tagset information. The description of models, tagsets, or other resources is apparently beyond it’s scope. For more details, refer to this issue. |
Draft |
102. Adding version information in the metadata descriptions of all resources
Although versioning is considered important for all resource types, the practices used for its encoding differ across them. Therefore, in OpenMinTeD the general rule is that the version of the resource must be clearly indicated in its metadata description. This is done with the element 'version' (of type xs:string) and, if the providers wish, they can also provide more information about the types of changes between the versions in the element 'revision' and the date of the last update ('lastDateUpdated').
However, the values that the 'version' element takes can differ depending on the source and resource type; given that the OMTD-SHARE element 'version' is of type xs:string, this can be accommodated without problems in the mapping process:
-
For publications: following the OpenAIRE guidelines (new version to be released), the values of the OpenAIRE element 'version' must be taken from a controlled vocabulary (accepted, published, draft, submitted, updated)
-
For all other resources, the most popular practice is to provide a number (e.g. 1.0, 3.0.1 etc.) and this is endorsed by OpenMinTeD; semantic versioning principles are recommended where possible for its assignment; the element, though, where this information is encoded in existing metadata descriptions may differ:
-
For components collected from Maven repositories: according to 96. Unique identifiers and versions for components using Maven, this information must be added in the 'version' element in the pom xml file
-
For resources described with metadata in RDF, the 'owl:versionInfo' is the equivalent to the OMTD-SHARE 'version' element
-
If there is no information on the metadata record of a resource harvested into OMTD, the version element is to be filled in with the creation date of the resource and, if this is again missing, the date of the registration of the resource in OpenMinTeD.
-
Note
|
In the next version of the schema, the "version" element will be restricted to the numeric type of encoding versions, following the semantic versioning principles, and another element will be added that will be of type "controlled vocabulary" that will take values such as accepted, published, submitted etc. for publications, but also alpha, beta, testing etc. for s/w resources. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Partial |
Versions exist for the components |
Draft |
|
OpenAIRE |
No |
Version information is currently not evaluated |
Draft |
|
Frontiers |
No |
Available Q3 2017 |
Final |
|
Alvis |
0.5rc |
Partial |
Versions exist for the components. In general, version for resources has to be adapted to follow the semantic versioning principles. |
Draft |
ARGO |
0.5 |
Partial |
Type Systems and Components have versioning information, however other resources do not. |
Draft |
DKPro Core |
1.8.0 |
Full |
Version information is contained in the pom file |
Draft |
GATE |
9 |
Full |
Components distributed via maven include version information within the pom.xml file. |
Draft |
ILSP |
1.2.1 |
No |
Not supported yet |
Draft |
TheSoz |
June-16 |
Full |
|
Final |
Agrovoc |
21/01/2016 |
Partial |
modification date (dcterms:modified) is available |
Final |
OLiA |
June-16 |
Partial |
no version id, but a textual description of the resource’s evolution. |
Final |
LAPPS |
June-16 |
No |
The LAPPS Web Service Exchange Vocabulary resides in a version control system, but it does not have a release cycle, unlike the Groovy-based API. Rather, the vocabulary is updated and extended without explicit versioning. |
Final |
103. Specifying access mode of resources and encoding it in the metadata descriptions
All content resources to be mined for OpenMinTeD purposes as well as ancillary resources (tagsets, lexica, models etc.) must be downloadable and accessible in a single step (cf. 4. URL to actual content must be discoverable, 99. Encoding in the metadata a direct access link for content resources). The elements 'distributionMedium', with values downloadable, accessibleThroughInterfacace, etc. (cf. link to the controlled vocabulary to be provided) is reserved for the access mode.
Components can be accessed as web services and/or as downloadable tools, and this must be indicated in the 'componentDistributionMedium' element with values webService, sourceCode, executableCode or sourceAndExecutableCode; if a component is available in more than one modes, it is recommended that this is described as two distinct distributions (i.e. with two 'distributionInfo' elements) that allow better encoding of all relevant details (e.g. potentially different licensing terms, different access locations etc.). See also 98. Publishing components via software repositories (Maven, Docker) for making components downloadable.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Full |
Resources are always available through the recommended interface (CORE API) or via direct URL (see req. 100). Therefore accessMode is at the same time downloadable (if direct URL is used) and accessibleThroughInterface (if API is used) |
Draft |
|
OpenAIRE |
Partial |
Access and download of content resources is in principle available through the OpenAIRE API, but currently restricted. Alternatively the links to the original URLs to access and download content is provided in the metadata. |
Draft |
|
Frontiers |
Partial |
Encoding partially available; Full deployment by Q3 2017 |
Final |
|
Alvis |
0.5rc |
Full |
The Alvis engine, its components and the available resources are downlodable from repositories and web sites. |
Draft |
ARGO |
0.5 |
Partial |
Source code for our open source components is available on GitHub. |
Draft |
DKPro Core |
1.8.0 |
Full |
DKPro Core components are maven modules which can always be accessed by downloading from maven repositories. |
Draft |
GATE |
8.2 |
Full |
All GATE plugins are either loaded via a single URL which points to a directory containing creole.xml (this tells us what other files we might need) or in the current dev version via maven coordinates. |
Draft |
ILSP |
1.2.1 |
Full |
ILSP components are available as web services |
Draft |
TheSoz |
June-16 |
Full |
SPARQL endpoint and URI lookup endpoint in |
Final |
Agrovoc |
21/01/2016 |
Full |
void:sparqlEndpoint, void:dataDump |
Final |
OLiA |
June-16 |
Full |
Downloadable URLs are given in rdf:about and xml:base attributes |
Final |
LAPPS |
June-16 |
Full |
The LAPPS Web Service Exchange Vocabulary can be downloaded in various formats from the webpage. Components that are offered through LAPPSGrid are offered as web services. |
Final |
104. Encoding funding information in the metadata descriptions of all resources
Funding information in metadata descriptions of all resources is optional; we recommend adopting the DataCite recommendations for all resources, which are also endorsed by OpenAIRE:
-
'fundingReference': 'funderName', and optional 'funderIdentifier' (with attribute 'type': ISNI, GRID, Crossref funder Id, other), 'awardNumber' (with attribute 'awardURI') and 'awardTitle'.
Note
|
in the current version of OMTD-SHARE (v1.1), funding information is encoded in the 'fundingProject' element, which can be filled in with a 'projectIdentifier' or a 'projectName'. Projects can also be uploaded to the OpenMinTeD registry with a set of specific metadata elements which include information on the funder(s), types of awards etc. The next version will be aligned with this requirement. |
To fulfill this requirement, further actions are in order:
-
addition in the component metadata of the required information; this can be done through the following recommendations:
-
Usage of the following java annotations (in line with OpenAire’s standards)
-
@funderName
-
@funderIdentifier (optional)
-
@funderIdentifierType (mandatory if funderIdentifier is present)
-
@awardNumber
-
@awardURI (optional)
-
@awardTitle
-
-
resources coming from the Linked Data community:
-
the 'foaf:fundedBy' property can be used to record the funding organization; if possible, adding an identifier (e.g. ISNI) for the organization would be an asset.
-
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
No |
CORE holds no funding information about the shared resources. As part of future developments this shall be retrieved from aggregated metadata or inferred (mined) from fulltext (see OpenMinTeD Scholarly Communication use case) |
Draft |
|
OpenAIRE |
Full |
Funder and project information is provided in the metadata exposed via the OpenAIRE API. |
Draft |
|
Frontiers |
No |
Available Q3 2017 |
Final |
|
Alvis |
0.5rc |
No |
The information does not exist or is not formalized. |
Draft |
ARGO |
0.5 |
No |
Argo components don’t contain any funding information and there are no plans to support this. |
Draft |
DKPro Core |
1.8.0 |
No |
DKPro Core is now a community effort based on volunteer contributions. There is no dedicated funding. So at the level of DKPro Core itself, this requirement does not apply. Regarding the third-party tools that we wrap, we do not keep track of third-party funding information. We are not aware of third-parties providing such information in an automatically processable way. If it were possible with reasonable effort to extract funding from the third-party components, we could consider surfacing this information e.g. in our component overview documentation. |
Final |
GATE |
8.2 |
No |
We don’t currently track this information and have no plans to do so in any systematic way, although documentation for specific components may give funding information. |
Draft |
ILSP |
1.2.1 |
No |
This information is not systematically added to components |
Draft |
TheSoz |
June-16 |
No |
Only resource publisher is declared |
Final |
Agrovoc |
21/01/2016 |
No |
only its creator (FAO) |
Final |
OLiA |
June-16 |
No |
Some ontologies reference the project in which the resource was created. |
Final |
LAPPS |
June-16 |
No |
The LAPPS Web Service Exchange Vocabulary does not specify any metadata elements relating to funding information. |
Final |
105. Encoding of format in the metadata description of content resources
Concreteness: concrete
Strength: mandatory
Category: WG1
Status: draft
All content resources (i.e. publications, but also ancillary resources, such as lexica, tagsets etc.) that are involved in the TDM process must include in the metadata description information on the format with which they are distributed (e.g. docx, xml, rdf/xml etc.). The OMTD-SHARE schema includes the following metadata elements:
-
'mimeType': mandatory; must be filled in with one of the values of the IANA (Internet Assigned Numbers Authority, http://www.iana.org/assignments/media-types/media-types.xhtml)
-
'dataFormatSpecific': this element is optional and can be used to provide further information about the format of the resource, when the 'mimetype' does not suffice, e.g. for xml files structured following a specific schema (e.g. structural annotation schema, such as TEI or JATS, or a simple CSV file with columns ordered in a specific way); to help the resource provider, the element takes a list of values with the data formats currently covered by partners of the OpenMinTeD consortium (link to the controlled vocabulary to be provided).
-
'documentationURL': for data formats that are not included in the previous list, the resource providers are recommended to include a link to a URL where the format is documented (e.g. with an XSD or other formal explanation of the format, examples etc.).
In addition, standard file extensions are highly recommended when publishing a content resource, as it helps in discovering the correct mimetype.
Note
|
If there are multiple distributions of the same resource with different formats from different locations, they should be described as separate distributions with the 'mimetype' repeated inside the 'distributionInfo' wrapper element. If the same location is used for all distributions (i.e. the access location is a landing page), providers can simply repeat the 'mimetype' inside the 'textFormatInfo' element at the resource level. For corpora composed of resources with various formats, again the 'mimetype' element inside the 'textFormatInfo' can be repeated. |
Note
|
As a general rule, standard text formats should be preferred, e.g. for structural annotation, the JATS article tag suite (https://jats.nlm.nih.gov/) or TEI (http://www.tei-c.org/index.xml). |
Note
|
Under consideration: discuss how the page with the controlled vocabulary of dataFormatSpecific can include more information on each format so that it’s directly usable for components. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Full |
Content resource is distributed using mimeType: http://www.iana.org/assignments/media-types/application/json for the metadata and http://www.iana.org/assignments/media-types/application/pdf for the content. documentationURL (for the metadata case) is https://core.ac.uk/docs/ |
Draft |
|
OpenAIRE |
Partial |
Format or mimetype of content resources are provided for content accessible through OpenAIRE interfaces. |
Draft |
|
Frontiers |
No |
Available Q3 2017 |
Final |
106. Encoding licensing terms in the metadata description of the resource
It is mandatory to include an indication of the licensing terms under which a resource is made accessible. In the OMTD-SHARE schema, this can be implemented in one of the following ways, in preferential order:
-
Selection from a list of values of the most popular standard licences in the 'licence' element (link to the controlled vocabulary to be provided) and indication of the licence version in the 'version' element;
-
For licences not included in the list, selection of the 'other' value in the 'licence' element and usage of the elements 'nonStandardLicenceTermsName' to provide a name for the licence and 'nonStandardLicenceTermsURL' to provide a link to the licence text or 'nonStandardLicenceTermsText' to add the licence text.
-
For legacy resources that come without a clear indication of licence but which can be classified into one of the Rights Statements discussed in OpenMinTeD, the last option is to use the element 'rightsStmtName' and 'rightsStmtURL' (with a link to the definition of the rights statement)
Note
|
For resources that are produced through OpenMinTeD operations (e.g. user queries for creating corpora, resources annotated by workflows in the platform etc.), the user will be shown an aggregation of all licence values of the resources. |
To fulfill the above requirements, further recommendations:
-
Components coming from Maven repositories: the wrapper element <license> with nested elements <name> & <url> for the full title of the licence (mentioning also version) and for the url respectively must be obligatorily filled in; if possible, use of one of the OpenMinTeD recommended licences is preferred.
-
Resources from the Linked Data community can use the dc:license or dc:rights elements to provide a link to the licence text or a licence name. The 'dc:license' (which requires a licence document) is preferable to 'dc:rights' (which can be filled in with a free text statement); it is essential to also attach the full legal document containing the licensing terms to the packaged resource and add the "licenseURI" in the metadata description.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Partial |
License is provided where license was explicitly stated in the aggregated resource, otherwise omitted |
Draft |
|
OpenAIRE |
No |
Currently not supported |
Draft |
|
Frontiers |
No |
Available Q3 2017 |
Final |
|
Alvis |
0.5rc |
Partial |
Provided in free-text mode. Identification of the licences according to the most popular standard licences in progress. |
Draft |
ARGO |
0.5 |
Partial |
Licenses of components maybe declared in the component metadata, in a LICENSE.txt file at the root of a component’s source code project or in the Maven POM file. |
Draft |
DKPro Core |
1.8.0 |
Partial |
For components, license information is in the POM file and more detailed in the LICENSE.txt file included in each Maven artifact. For models, the information may presently be missing or only be available in a DKPro Core-specific metadata file associated with the model. |
Draft |
GATE |
9 |
Full |
License information should be includedin the pom.xml from where it can be automatically extracted |
Draft |
ILSP |
1.2.1 |
Partial |
Components provided as web services are accompanied by a statement that they provided free for research purposes. |
Draft |
TheSoz |
June-16 |
Full |
|
Final |
Agrovoc |
21/01/2016 |
Full |
dc:license in void.ttl file |
Final |
OLiA |
June-16 |
No |
Not yet; plan to be released under a Creative Commons Attribution Sharealike licence as soon as a reference publication has appeared |
Final |
LAPPS |
June-16 |
No |
The LAPPS Web Service Exchange Vocabulary neither provides a license nor specifies any metadata elements relating to licensing information. |
Final |
107. Encoding metadata on domain/subject/ classification for all resources when applicable
Domain/subject classification is valuable both for the discovery of resources of particular interest to the users, as well as for ensuring better results for the TDM processes. However, the vocabularies used for the classification of resources are numerous (cf. {http://id.loc.gov/vocabulary/classSchemes.html} and {https://www.loc.gov/standards/sourcelist/subject.html}), and there isn’t yet any common consensus for adopting only a few of them.
Therefore, OpenMinTeD suggests that (a) relevant metadata elements are set at the recommended level, (b) the use of widespread classification vocabularies is promoted and (c) mandates that at least the classification vocabulary, if any, used for the specific resource is described with a name and URI.
The above requirements translate into the following metadata elements:
-
recommended 'domain', with attributes mandatory 'classificationSchemeName' and optional 'schemeURI'
-
recommended 'subject', with attributes mandatory 'classificationSchemeName' and optional 'schemeURI'
-
recommended 'keyword', which can be used as a last resort to at least ingest keywords currently used for the classification of resources.
The attribute 'classificationSchemeName' uses a controlled vocabulary (link to be provided) with widespread classification schemes to help users in filling in this information.
The 'dc:subject' element can be used for encoding classification, but this doesn’t provide the required information.
Note
|
Classification is also possible with the metadata elements 'textGenre', 'textType', and 'register', which also carry the same attributes as domain and subject. |
Components are also included in the compliance table, although in the vast majority, they are domain-independent. The reason for their inclusion is to indicate whether they come with specific resources (e.g. models) that are trained with corpora of specific domains.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Partial |
(i) ‘domain’ is left empty as at the moment CORE does not classify publications to research areas. (ii) ‘Subject’ field is populated where applicable according to the aggregated metadata of the resource (e.g. what is provided in dc:subject tag). (iii) ‘Keywords’ is left empty as at the moment CORE does not perform keyword extraction. (i) and (iii) are part of CORE’s future endeavours and implementations to happen through the aid of OpenMinTed platform |
Draft |
|
OpenAIRE |
Partial |
keywords and subjects are provided in the metadata. |
Draft |
|
Frontiers |
Partial |
Domain/subject classification based on PMC naming conventions |
Draft |
|
Alvis |
No |
Not currently |
Draft |
|
ARGO |
0.5 |
No |
Currently no support for this. |
Draft |
DKPro Core |
1.8.0 |
No |
For some models, the “variant” coordinate includes relevant information. |
Draft |
GATE |
8.2 |
No |
No current built in support but I assume this is another OMTD-SHARE level requirement which has no bearing on the actual component implementations |
Draft |
ILSP |
1.2.1 |
No |
Not supported. |
Draft |
TheSoz |
June-16 |
Full |
via DbPedia URIs. The following values are present in the current |
Final |
Agrovoc |
21/01/2016 |
Full |
DbPedia URIs |
Final |
OLiA |
June-16 |
No |
No current support provided |
Draft |
LAPPS |
June-16 |
No |
The LAPPS Web Service Exchange Vocabulary does not specify any metadata elements relating to domain/subject classification, and neither does the LAPPS web site give a description of that domain, beyond saying that it is for linguistic data. |
Draft |
108. Encoding language information in the metadata of content resources
Language is one of the most important features for discovering content resources and for ensuring the proper operation of processing components. Therefore, it is important not only to include it in the metadata descriptions of resources but also to normalise it.
For normalistaion purposes, OpenMinTeD recommends the use of the IETF’s BCP 47 (https://tools.ietf.org/html/bcp47), which specifies syntax for language tags, so as to include language, region, script and variant tags in one code.
Language information is mandatory and can be repeated for multilingual resources.
The metadata elements to be used are:
-
'language' for the language of the contents of the resource: mandatory language tag, which is concatenated by mandatory language identifier (from the ISO 639 codes) and optional script, region and variant tags
-
'metalanguage': can be optionally used for lexical/conceptual resources and language descriptions, for the language that is used to describe a resource (e.g. a lexicon of Greek terms with definitions in English is considered as having 'language' Greek and 'metalanguage' English).
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Full |
Language is provided where available through the metadata of the aggregated resource or inferred through the full text |
Draft |
|
OpenAIRE |
Full |
Language is provided where available through the metadata of the aggregated resource. |
Draft |
|
Frontiers |
No |
English only supported |
Draft |
|
TheSoz |
June-16 |
Full |
via language attribute for each entity in the thesaurus |
Final |
Agrovoc |
21/01/2016 |
Full |
Labels are associated with language attributes, see models linked from http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus |
Final |
OLiA |
June-16 |
No |
Language only indirectly deducible from the description of the resource in the data type property 'comment' value. |
Final |
LAPPS |
June-16 |
Partial |
The LAPPS Web Service Exchange Vocabulary defines a |
Final |
109. Encoding statistical information in the content resources
Statistical metrics are one of the ways that one can use to decide on whether a certain resource is of interest as well as to monitor versions of the same resource. Size is a feature that can be considered stable enough to use for this reason; however, size is recorded with various units of measurement, depending on the type of the resource as well as on community practices.
For this reason, OpenMinTeD uses a set of two mandatory metadata elements, 'size' and 'sizeUnit' which are combined together in the 'sizeInfo' element. 'SizeInfo' is obligatory and can be repeated to record different size dimensions, if the users wishes to. 'Size' records in the form of a number the size of a resource with regard to the 'sizeUnit', while 'sizeUnit' takes values from a controlled vocabulary (link to be provided), which includes a list of the most widespread values for resources.
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Full |
Size and sizeunit of the resource is provided |
Draft |
|
OpenAIRE |
Partial |
Size information is support for "datasets" only. |
Draft |
|
Frontiers |
Full |
Number of words per publication |
Final |
|
TheSoz |
June-16 |
Full |
number of triples contained inside the resource and number of triples aligned with other resources (e.g. dbpedia) |
Final |
Agrovoc |
21/01/2016 |
Full |
number of concepts, labels |
Final |
OLiA |
June-16 |
No |
The OLiA website does not specify any statistics, or metadata elements covering statistics. http://nachhalt.sfb632.uni-potsdam.de/owl/ |
Final |
LAPPS |
June-16 |
No |
The LAPPS Web Service Exchange Vocabulary does not define any metadata elements related to statistical information, and the LAPPS website does not give this information in any other form. |
Final |
110. Assigning a unique persistent identifier for all resources
All resources must be assigned an identifier (e.g. DOI, Handle PID, URI etc.) that allows them to be uniquely, non-ambiguously and persistently identified.
Resources that already have a PID, can include this in their metadata descriptions, in the element 'identifier'/'publicationIdentifier' using the 'resourceIdentifierSchemeName'/'publicationIdentifierSchemeName' attribute to specify the corresponding scheme, for instance:
-
publications that already have a DOI can include this in the 'publicationIdentifier' with the value doi 'publicationIdentifierSchemeName'; DOIs are the recommended identifiers for publications
-
datasets, lexica, etc. that already have a DOI or a Handle id, will include these in the 'resourceIdentifier' with 'resourceIdentifierSchemeName' doi or hdl respectively
-
knowledge resources that come from the Linked Data community and have a URI, can encode this in the 'resourceIdentifier' with 'resourceIdentifierSchemeName' uri
-
s/w components, in accordance to 96. Unique identifiers and versions for components using Maven, can include the component coordinates and add the value other for 'resourceIdentifierScheme' and https://maven.apache.org/pom.html#Maven_Coordinates for 'schemeURI'.
Resources without PIDs will be assigned one according to the procedure specified in the OMTD policies.
Note
|
The element 'identifier' is not reserved for PIDs; it can be used for any kind of identifier, including organization-internal identifiers. |
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
CORE |
Full |
CORE publications are uniquely and persistently identified (using the CORE’s internal ID) |
Draft |
|
OpenAIRE |
Partial |
Whenever possible resources are assigned persistent identifiers; in addition OpenAIRE object identifiers are assigned whereas their persistence depends on the original resource identifier |
Draft |
|
Frontiers |
Full |
DOI’s for all publications |
Final |
|
Alvis |
0.5rc |
Full |
Components can be uniquely and persistently identified |
Draft |
ARGO |
0.5 |
Full |
All Argo components and type systems are identifiable using Maven coordinates. |
Draft |
DKPro Core |
1.8.0 |
Full |
UIMA components are provided as Maven artifacts. |
Draft |
GATE |
9 |
Full |
Components distributed via maven can be uniquely and persistently identified via their maven coordinates and class name |
Draft |
ILSP |
1.2.1 |
Full |
UIMA components are developed as Maven artifacts. |
Draft |
TheSoz |
June-16 |
Full |
http://lod.gesis.org/thesoz/ ( |
Final |
Agrovoc |
21/01/2016 |
Full |
URI udentifers are provided for all resources |
Draft |
OLiA |
June-16 |
Full |
PURLs are used, see http://nachhalt.sfb632.uni-potsdam.de/owl/ |
Draft |
LAPPS |
June-16 |
Full |
The LAPPS Web Service Exchange Vocabulary defines URLs for all its metadata elements. Web services reside at known URLs. The vocabulary itself resides at known URIs. |
Draft |
111. Annotation schema dependencies for UIMA components using Maven
Concreteness: concrete
Strength: mandatory
Category: WG4
Status: draft
A UIMA component, which is managed using Maven, will declare any required type systems as Maven dependencies.
The Maven artifacts, containing the type systems, will have UIMA resource specifiers describing the type systems (in terms of types and type features). These resource specifiers can be located in two ways:
-
The Maven type system artifacts can contain a locator file file called
META-INF/org.apache.uima.fit/types.txt
which declares pointers to all of the type system descriptor files existing within the artifact.
classpath*:my/organizations/domain/name/package/TypeSystem.xml
-
The resource specifier of the UIMA component can explicitly declare the names of type systems it requires. These names should be written as fully qualified Java-style dotted names and refer to XML files at the given locations in the classpath.
<typeSystemDescription> <imports> <import name="my.organizations.domain.name.package.TypeSystem" /> </imports> </typeSystemDescription>
Product | Version | Compliant | Justification | Status |
---|---|---|---|---|
Alvis |
0.5rc |
Partial |
Currently the Alvis type system exists only as Java classes. A path to the main Java class can be specified but for an automatic interpretation of the type system some conventions need be formalized. |
Draft |
ARGO |
0.5 |
Full |
All Argo components include type systems as Maven dependencies and explicitly declare the name of the type systems used within their XML descriptor. |
Draft |
DKPro Core |
1.8.0 |
Full |
All DKPro Core components include type systems as Maven dependencies and explicitly declare the name of the type systems used within their XML descriptor. |
Draft |
ILSP |
1.2.1 |
Full |
UIMA-based ILSP components include Maven dependencies to a project providing a common type system. |
Draft |
Bibliography
-
[uimafit] Philip V. Ogren, Steven J. Bethard (2009) Building Test Suites for UIMA Components Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing (SETQA-NLP 2009). June 2009. (PDF)
-
[Prokopidis2011]Prokopidis, P., Georgantopoulos, B. & Papageorgiou, H. (2011). A suite of NLP tools for Greek. in the 10th International Conference of Greek Linguistics. Komotini, Greece. (PDF)
-
CORE FAQ’s: https://core.ac.uk/about#faqs
-
Knoth, P. (2013). From Open Access Metadata to Open Access Content: Two Principles for Increased Visibility of Open Access Content. Open Repositories 2013, Charlottetown, Prince Edward Island, Canada. (PDF)
-
Príncipe, Pedro, Najla Rettberg, Eloy Rodrigues, Mikael K. Elbæk, Jochen Schirrwagen, Nikos Houssos, Lars Holm Nielsen, Brigitte Jörg (2014). OpenAIRE Guidelines: Supporting Interoperability for Literature Repositories, Data Archives and CRIS. Procedia Computer Science. Special issue on 12th International Conference on Current Research Information Systems, CRIS 2014 — Managing data intensive science – The role of Research Information Systems in realising the digital agenda. Elsevier. http://doi.org/10.1016/j.procs.2014.06.015
-
OpenAIRE guidelines: https://guidelines.openaire.eu/en/latest/literature/index.html