OMTD-SHARE ontology for resources related to Text and Data Mining initiated in the framework of the OpenMinTeD project (https://www.openminted.eu). It currently focuses on TDM functions (tasks performed by TDM software), annotation types (types of information extracted or annotated by TDM software), TDM methods (classification of the theoretical method used in the TDM algorithm), and data formats of the resoures that can be processed by TDM software. The ontology is deployed in the OMTD-SHARE metadata schema for the description of such resources (https://openminted.github.io/releases/omtd-share/) and is based on work done previously in the framework of the META-SHARE schema for language resources(http://www.meta-share.org/knowledgebase/homePage) and the FOSTER taxonomy of TDM (https://www.fosteropenscience.eu/resources).
OMTD-SHARE ontology
Richard Eckart de Castilho
Claire Nedellec
Dimitris Galanis
Katerina Gkirtzou
Marta Villegas
Penny Labropoulou
Petr Knoth
Sophie Aubin
2017-12-22
omtds
The OMTD-SHARE ontology is intended for resources related to Text and Data Mining.
It currently focuses on TDM functions (tasks performed by TDM software), annotation types (types of information extracted or annotated by TDM software), TDM methods (classification of the theoretical method used in the TDM algorithm), and data formats of the resoures that can be processed by TDM software. The ontology is deployed in the OMTD-SHARE metadata schema for the description of such resources (https://openminted.github.io/releases/omtd-share/).
The ontology has been initiated in the framework of the OpenMinTeD project (https://www.openminted.eu). It is based on work done previously in the framework of the META-SHARE schema for language resources(http://www.meta-share.org/knowledgebase/homePage) and the FOSTER taxonomy of TDM (https://www.fosteropenscience.eu/resources).
The ontology is still in a working status.
OMTD-SHARE ontology
v1.0.0
Concept A is used frequently in domain B
deployed in domain
Used to classify annotations
has annotation type
Relates a data format to the IANA mimetype; it can be the exact or a broader mimetype; unofficial mimetypes are also used, but this relation will be revisited
has mimetype
Component A performs Operation B
performs Operation
performs Task
owl:equivalentClass
The URL link in which a concept is documented
documentation URL
The file extension usually associated with a specific data format (e.g. txt for plain text files, pdf for PDF files etc.)
has file extension
A component that provides access to data resources, e.g. reads a resource or writes the output of a process in a certain format
Access Component
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-AclAnthology
Data format specific to the ACL Anthology Reference Corpus (http://acl-arc.comp.nus.edu.sg/), most probably version 20080325
ACL Anthology Corpus format
Any kind of annotation pertaining to entities of the agricultural domain; the use of the AGROVOC thesaurus is recommended
Agricultural entity
The science, art, or practice of cultivating the soil, producing crops, and raising livestock and in varying degrees the preparation and marketing of the resulting products [https://www.merriam-webster.com/dictionary/agriculture]
Agriculture
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#Aimed_Collection_Reader
Format of the Aimed corpus (225 abstracts from MEDLINE) with the gold standard sentence, protein, protein-protein interaction annotations.
AIMED corpus format
A component that detects and annotates equivalence relations between items (corpora, texts, paragraphs, sentences, phrases, words) in two languages
Aligner
ALLBUS variable
ALLBUS variable
http://www.lrec-conf.org/proceedings/lrec2006/pdf/742_pdf.pdf
Format for linguistic annotations of documents used for the ALVIS framework
ALVIS Enriched Document format
A component that is used for analyzing an input text in order to extract specific features/information (e.g. word list), or to produce statements over the whole text (e.g. classify it by topic)
Analyzer
Extractor
A note by way of explanation or comment added to a text or diagram [OED, https://en.oxforddictionaries.com/definition/annotation]. Text or corpus annotation refers to the interpretative linguistic information grounded in a knowledge resource that is added manually or automatically to a text or corpus respectively.
The process/task of adding annotations to an item
Annotation
Labelling
Tagging
Any format used for annotated textual documents
Annotation format
The task/process of marking compounds (single words composed of two or more free morphemes) and their parts
Annotation of compounds
The task/process of adding annotations relevant to the derivational level of analysis (e.g. recognizing derivational affixes, tagging their meaning etc.)
Annotation of derivational features
The task/process of annotating the internal structure of a document (e.g. book chapters, sections in a journal article, title, preface, images/figures etc.)
Annotation of document structure
The task/process of annotating multi-word units, i.e. combinations of words that are considered as one
Annotation of multi-word units
Category/class of the annotations (metadata) that are added to the data/text that is processed
Annotation type
Label
Tag
A component that annotates any data (text, video, audio etc.), i.e. adds any descriptive or analytic notations (structural, linguistic, etc) to raw data
Annotator
Tagger
A component that annotates the tokens of a text with Semantic Role labels
Annotator of semantic role labels
The act or process of forming reasons and of drawing conclusions and applying them to a case in discussion [https://www.merriam-webster.com/dictionary/argumentation]
Argumentation
adapted from wikipedia (https://en.wikipedia.org/wiki/Artificial_neural_network)
A computational model based on a large collection of simple neural units (artificial neurons), loosely analogous to the observed behavior of a biological brain's axons. These systems are self-learning and trained, rather than explicitly programmed.
Artificial Neural Network
A machine learning method used in recognising relationships among variables in databases and extracted in the form of rules.
Association Rule Learning
ANN
The response of the target recipients (audience) to a system, process or event
Audience reaction
adapted from Wikipedia (https://en.wikipedia.org/wiki/Bayesian_inference)
A method in probability and statistics based on Bayes' theorem, mainly related to statistical inference.
Bayesian
The annotation of words with morphological information besides the part of speech and dependent upon it (e.g. for nouns: gender, number and case; for verbs: tense, number, person etc.)
Below PoS Tagging
Annotation of morphological features
B-PoS Tagging
The task/process of inducing word translations from monolingual or comparable corpora in two languages
Bilingual lexicon induction
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.bincas-asl
Binary format used for CAS data
Binary CAS
UIMA Binary CAS
Any format of a computer file in which information is stored in the form of ones and zeros, or in some other binary (two-state) sequence; used mainly for executable files or files that need to be interpreted by a computer program
Binary format
Biological activity
Biological activity
Any kind of annotation pertaining to entities of biology
Biological enity
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#_bionlp_shared_task_2
File format used for the BioNLP Shared Task format
BioNLP
Formats used for BioNLP shared tasks
BioNLP formats
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#_bionlp_st_2013_a1_a2_1
Format used in BioNLP Shared Task 2013
BioNLP-ST 2013 a1/a2
bioNLP; format-variant=ST2013a1_a2
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.bliki-asl
The Java Wikipedia API (Bliki engine) is a parser library for converting Wikipedia wikitext notation to HTML.
blikiWikipedia
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.bnc-asl
Data format for the XML version of the British National Corpus (http://www.natcorp.ox.ac.uk/)
BNC format
Part of the brain
Brain region
http://brat.nlplab.org/standoff.html
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.brat-asl
BRAT stand-off format for annotations (BRAT is a online environment for collaborative text annotation, cf. http://brat.nlplab.org/)
BRAT
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#ExportCadixeJSON
AlvisAE protocol format
Cadixe/JSON
Degree of certainty about the validity of what is being asserted in the text
Certainty level
http://talkbank.org/manuals/CHAT.pdf
CHAT (Codes for the Human Analysis of Transcripts) transcription format; used by CHILDES corpora
CHAT
Codes for the Human Analysis of Transcripts
Any substance (as an acid) that is formed when two or more other substances act upon one another or that is used to produce a change in another substance [https://www.merriam-webster.com/dictionary/chemical]
Chemical
Any kind of annotation pertaining to entities from chemistry
Chemical entity
Group of words that function together; a chunk normally includes a head and some consecutive (i.e. without gaps) preceding words
Chunk
A component that groups tokens of a text into chunks
Chunker
The task/process of dividing a sentence into chunks (non-overlapping text segments consisting of a head and preceding function words and/or modifiers)
Chunking
Light parsing
Shallow parsing
Reference to a book, paper, or author, especially in a scholarly work.
Citation
A clause is a subdivision of a sentence containing a subject (argument) and predicate. It is possible to have a word that implies or refers to a predicate rather than one explicitly stated. [Pei & Gaynor 1980: 40, http://linguistics-ontology.org/gold/2010/Clause]
Clause
adapted from wikipedia (https://en.wikipedia.org/wiki/Cluster_analysis)
Any method used in clustering or cluster analysis, i.e. in grouping a set of objects in such a way that objects in the same group (cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
Clustering Method
k-means, k-nearest neighbours
A component that annotates tokens of a text with coreference labels, marking expressions that refer to the same entity in the text
Co-reference annotator
The process/task of determining all linguistic expressions that refer to the same entity in a certain text or across texts.
Co-reference resolution
https://gate.ac.uk/sale/tao/splitch23.html#sec:creole:pubmed
Format used in Cochrane texts
Cochrane
Specifies the type of a component, in terms of the function/task it performs
Component type
A single word composed of two or more free morphemes
Compound
The process/task of identifying and representing argumentation in text, so that systems have the ability to use them in tasks, such as automated logical reasoning
Computational argumentation
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-Conll2000
The CoNLL 2000 format represents POS and Chunk tags. Fields in a line are separated by spaces. Sentences are separated by a blank new line.
CoNLL-2000
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-Conll2002
The CoNLL 2002 format encodes named entity spans. Fields are separated by a single space. Sentences are separated by a blank new line.
CoNLL-2002
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20Core%20Documentation%20(GitHub)/de.tudarmstadt.ukp.dkpro.core$de.tudarmstadt.ukp.dkpro.core.doc-asl/doclinks/5/format-reference.html#format-Conll2003
The CoNLL 2004 format encodes named entity spans and chunk spans. Fields are separated by a single space. Sentences are separated by a blank new line. Named entities and chunks are encoded in the IOB1 format. I.e. a B prefix is only used if the category of the following span differs from the category of the current span.
CoNLL-2003
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-Conll2006
The CoNLL 2006 (aka CoNLL-X) format targets dependency parsing. Columns are tab-separated. Sentences are separated by a blank new line.
CoNLL-2006
CoNLL-2007
CoNLL-X
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20Core%20Documentation%20(GitHub)/de.tudarmstadt.ukp.dkpro.core$de.tudarmstadt.ukp.dkpro.core.doc-asl/doclinks/5/format-reference.html#format-Conll2008
The CoNLL 2008 format targets syntactic and semantic dependencies. Columns are tab-separated. Sentences are separated by a blank new line.
CoNLL-2008
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-Conll2009
The CoNLL 2009 format targets semantic role labeling. Columns are tab-separated. Sentences are separated by a blank new line.
CoNLL-2009
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-Conll2012
The CoNLL 2012 format targets semantic role labeling and coreference. Columns are tab-separated. Sentences are separated by a blank new line.
CoNLL-2012
Formats used in the CoNLL Shared Tasks
CoNLL format
http://universaldependencies.org/docs/format.html
Format used for CoNLL.
CoNLL-U
A component that builds a constituency tree from typically token and part-of-speech annotations
Constituency parser
The task/process of identifying and marking constituents (phrases, governed by a head and including function words and/or modifiers ) in a text or text segment
Constituency parsing
Phrase parsing
An ordered, rooted tree that represents the syntactic structure of a string according to a constituency grammar (= phrase structure grammars). It distinguishes between terminal and non-terminal nodes. The interior nodes are labeled by non-terminal categories of the grammar (phrases), while the leaf nodes are labeled by terminal categories (parts of speech). [adapted from https://en.wikipedia.org/wiki/Parse_tree]
Constituency tree
A word or group of words that function as a single unit in a syntactic structure
Constituent
The automated analysis of large volumes of content of any form or medium (e.g. text, images, videos, graphs, metadata etc.) that leads to the discovery of previously undiscovered information (e.g. identification of relationships between entities).
Content Mining
A set of statements that contradict each other (i.e. one of them asserts the truth and the other the falsity of the proposition)
Contradiction
The task/process of identifying conflicting statements (contradictions) in a dataset
Contradiction detection
A component that tries to automatically recognize elements that reveal contradiction in a text
Contradiction detector
could also be an annotator
A component that performs conversion between formats of a resource
Converter
Coreference is the reference in one expression to the same referent in another expression. [http://www.glossary.sil.org/term/coreference]
Coreference
Co-reference
As defined here it refers more to the phenomenon than the actual annotation types; referent and co-referent might be more appropriate
A format used by a specific type of corpus (collection of texts)
Corpus format
A component that supports humans in accessing the contents of a corpus
Corpus viewer
The task/process of viewing the contents of a corpus as performed by human beings
Corpus viewing
A component that crawls the web and collects data from various web sites
Crawler
The use of bots that crawl the web (crawlers) in order to spot content that matches user-set criteria and download them to create large datasets
Crawling
Web crawling
The practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people and especially from the online community rather than from traditional employees or suppliers
Crowdsourcing
A component that supports crowdsourcing operations
Crowdsourcing component
Data format with comma-separated values
CSV
Comma-separated values
The process of gathering and measuring information on targeted variables in an established systematic fashion, which then enables one to answer relevant questions and evaluate outcomes.
Data collection
A component that collects (retrieves) data from various sources
Data collector
dc:format
The format of a computer file storing data
Data format
Data type
File format
A component that supports data merging from various sources
Data merger
The task/process of merging (combining) together data from various sources
Data merging
A component that performs data splitting for cross validation purposes
Data splitter
The task/process of splitting (partitioning) available data into parts, usually for cross-validatory purposes, e.g. in order to use one part for training purposes and the other for evaluation.
Data splitting
Formats used for databases
Database format
https://gate.ac.uk/sale/tao/splitch23.html#x28-59500023.32
Common format for social media data from http://datasift.com
DataSift/JSON
A text unit that denotes a date, a specific point in time
Date
A component that is used in the debugging process
Debugger
The task/process of removing errors from a computer programme
Debugging
adapted from (http://scikit-learn.org/stable/modules/tree.html)
and wikipedia (https://en.wikipedia.org/wiki/Decision_tree)
A non-parametric supervised learning method used for classification and regression. The goal is to create a tree-like graph or model of decisions and their possible consequences by learning simple decision rules inferred from the data features.
Decision Trees
adapted from wikipedia (https://en.wikipedia.org/wiki/Deep_learning)
A branch of machine learning based on deep neural networks. A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers of units between the input and output layers.
Deep Learning
A type of syntactic relation that holds between linguistic units, where we try to recognise the head (governor) and its dependents
Dependency
The task/process of converting constituency structures to dependency trees
Dependency conversion
A component that converts a constituency tree into a dependency tree
Dependency converter
A component that generates a dependency tree from typically token and part-of-speech annotations
Dependency parser
adapted from https://nlp.stanford.edu/software/nndep.shtml
The task/process of identifying and marking the grammatical structure of a sentence, establishing relationships between "head" words and words that modify those heads
Dependency parsing
A tree that represents the dependency relations in a sentence, i.e. showing the governor (head) and its dependents with directed links
Dependency tree
The analysis of a word in order to identify its derivation, i.e. whether and how it has been formed on the basis of another word (e.g. through the use of affixes)
Derivational Analysis
Any feature relevant to the derivation process of a word (e.g. marking affixes, their meaning etc.)
Derivational feature
A dialogue act has two main components: a communicative function and a semantic content. The semantic content specifies the objects, relations, actions, events, etc. that the dialogue act is about; the communicative function can be viewed as a specification of the way an addressee uses the semantic content to update his or her information state when he or she understands the corresponding stretch of dialogue. [http://www.lrec-conf.org/proceedings/lrec2010/pdf/560_Paper.pdf]
Dialogue act
https://www.iso.org/standard/51967.html
Format following Dialogue Act Markup Language (DiAML) which is defined within the ISO standard 24617-2
DIAML
adapted from Wikipedia (https://en.wikipedia.org/wiki/Dimensionality_reduction)
A method based on reducing the number of random variables under consideration, via obtaining a set of principal variables.
Dimensionality Reduction
A component that is used to disambiguate between two or more ambiguous items
Disambiguator
The relation that holds between two segments of discourse; e.g. causal, temporal etc.
Discource relation
A method of analysing the structure of texts or utterances longer than one sentence, taking into account both their linguistic content and their sociolinguistic context; analysis performed using this method.[OED, https://en.oxforddictionaries.com/definition/discourse_analysis]
Discourse analysis
The task/process of adding annotations relevant to discourse, such as discourse structure, discourse markers etc.
Discourse annotation
Any type of annotation relevant to discourse
Discourse annotation type
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-TokenizedText
DkPro format for tokenized files containing one sentence per line and tokens split by whitespaces.
DKPro tokenized
Any kind of annotation that is used to describe a document (e.g. identifier, size, location, language etc.)
Document annotation type
Document information
Document metadata
A component that tries to classify a document into one or more categories
Document classifier
Any format used for documents (textual resources)
Document format
Any subdivision of a document, e.g. a chapter, abstract, etc.
Document section
Area of interest or activities
Domain
Any kind of annotation that is used for specific domains (e.g. genes and proteins from the biomedical domain, plants from agriculture etc.)
Domain-specific annotation type
The task/process of changing the contents of a resource
Editing
A component that allows humans to edit the contents of a resource
Editor
Data format according to the EMMA (Extensible MultiModal Annotation markup language) specifications, cf. https://www.w3.org/TR/2007/CR-emma-20071211/
EMMA
An affective state of consciousness in which joy, sorrow, fear, hate, or the like, is experienced, as distinguished from cognitive and volitional states of consciousness [http://www.dictionary.com/browse/emotion]
Emotion
The process/task of identifying types of feelings (e.g. anger, fear, happiness, sadness, etc.) in the linguistic expression of texts or facial expressions
Emotion detection
Emotion recognition
A component that tries to recognize and annotate emotions (e.g. fear, anger, happiness etc.) from text, video, audio and image
Emotion recognizer
Emotion detector
could also be an annotator
adapted from Wikipedia (https://en.wikipedia.org/wiki/Ensemble_learning)
Any method that uses multiple learning algorithms in an attempt to improve predictive performance not obtainable otherwise with any of the constituent learning algorithms.
Ensemble Method
The pair of an entity and all the mentions of this entity formulated in various ways; used in co-reference resolution
Entity-Mention pair
The task/process of detecting in a text mentions of a specific class of entities (e.g. biochemical entities, historical persons)
Entity mention recognition
unclear definition
The task/process of assessing the quality of a resource, e.g. based on the contents (for a dataset) or performance (for a tool or service)
Evaluation
A component that is used in the evaluation of the performance of a component
Evaluator
A thing that happens or takes place, especially one of importance [https://en.oxforddictionaries.com/definition/event]
Event
The process/task of identifying events in data (text, video, images etc.), usually combined with their classification into types of events and recognition of the event attributes (e.g. time, place, participants and duration)
Event detection
Event extraction
A component that tries to extract information related to incidents referred to in a text
Event extractor
could also be an annotator
A type of search which, in contrast to traditional lookup search, covers a broad class of activities, such as investigating, evaluating, comparing, and synthesizing
Exploratory search
Extraction of information that pertains to specific domains/disciplines; it can be used combined with "Annotation type" to specify the type of information extracted
Extraction of domain-specific information
The task/process of detecting in a text and extracting information relevant to funding (e.g. funding programme, award, funder etc.)
Extraction of funding information
Mining of funding information
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#_factored_tag_lem_1
Factored tag lemma format
Factored tag lem format
https://gate.ac.uk/sale/tao/splitch23.html#x28-59400023.31
A compressed binary encoding of GATE XML
Fast Infoset
Feature extraction consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning
Feature extraction
A component that is used for extracting features
Feature extractor
could also be under analyzer as a general term
A component that is used for filtering text input or annotations based on specific criteria
Filter
In data communications, flow control is the process of managing the rate of data transmission between two nodes to prevent a fast sender from overwhelming a slow receiver. It provides a mechanism for the receiver to control the transmission speed, so that the receiving node is not overwhelmed with data from transmitting node.
Flow control
A component that supports controlling flows
Flow controller
https://proycon.github.io/folia/
FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources
FoLiA
Format for Linguistic Annotation
The task/process of converting (changing) the format of a resource into another (e.g. PDF to TXT or XML)
Format conversion
Data conversion
File conversion
Annotation related to the funding of a resource (e.g. funder, funding project, etc.)
Funding
Formats used for the GATE framework
GATE format
XML-based format for GATE components
GATE XML
https://gate.ac.uk/sale/tao/splitch17.html
A Twitter-style JSON format used for GATE documents
GATE/JSON
Twitter/JSON
A component that allows matching of elements based on a gazeteer
Gazeteer based matcher
The task/process of performing a comparison between a text/dataset and a gazeteer and identifying in the text/dataset units that are included in the gazeteer
Gazeteer based matching
Specific sequence of nucleotides along a molecule of DNA (or, in the case of some viruses, RNA) which represents functional units of heredity [http://artemide.art.uniroma2.it:8081/agrovoc/agrovoc/en/page/c_3214]
Gene
A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions [https://en.wikipedia.org/wiki/Gene_family]
Gene family
A component that generates (semi-)automatically natural language texts (based on non-linguistic data, keywords, logical forms, knowledge bases...)
Generator
http://dl.acm.org/citation.cfm?id=1642060
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20Core%20Documentation%20(GitHub)/de.tudarmstadt.ukp.dkpro.core$de.tudarmstadt.ukp.dkpro.core.doc-asl/doclinks/5/format-reference.html#format-Graf
GrAF (Graph Annotation Format) is an extension of the Linguistic Annotation Framework (LAF)
GrAF
Graph Annotation Format
A component that corrects grammatical mistakes in a text
Grammar checker
A type of grape
Grape variety
The place or environment where an organism, plant or animal naturally or normally lives and grows
Habitat
Historical event
Historical event
HTML format
HTML
https://www.w3.org/TR/microdata/
Format according to the specifications of HTML5 Microdata
HTML5 Microdata
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#I2B2Reader
https://www.i2b2.org/NLP/RDoCforPsychiatry/PreviousChallenges.php
Format of the I2B2 challenge
I2B2
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.imscwb-asl
A tab-separated format with limited markup (e.g. for sentences, documents, but not recursive structures like parse-trees) used by the IMS Open Corpus Workbench.
imsCwb
IMS Corpus Workbench
The process/task of automatically extracting structured information from unstructured and/or semi-structured data
Information extraction
A component that automatically extracts structured information from unstructured and/or semi-structured machine-readable documents
Information extractor
The process/task of removing (filtering out) redundant or unwanted information from an information stream using (semi)automated or computerized methods prior to presentation to a human user; the selection of the items is based on the correlation between the content of the items and the user’s preferences (content-based filtering) or the correlation between people with similar preferences (collaborative filtering)
Information filtering
The delivery of information in the form of suggestions by recommender systems; recommender systems seek to predict the "rating" or "preference" that a user would give to an item
Information filtering by recommender systems
The activity of obtaining information resources relevant to an information need from a collection of information resources; searches can be based on full-text or other content-based indexing
Information retrieval
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.xml-asl
Inline XML file format
Inline XML
adapted from Wikipedia (https://en.wikipedia.org/wiki/Instance-based_learning)
A family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.
Instance-based Learning
Examples of instance-based learning algorithm are the k-nearest neighbor algorithm, kernel machines and RBF networks.
Ion channel
A single protein or protein complex that traverses the lipid bilayer of cell membrane and form a channel to facilitate the movement of ions through the membrane according to their electrochemical gradient [http://www.biology-online.org/dictionary/Ion_channel]
Ionic channel
Ion conductance
Ionic conductance
Ionic conductance
Ion current
The influx and/or efflux of ions through an ion channel
Ionic current
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20Core%20Documentation%20(GitHub)/de.tudarmstadt.ukp.dkpro.core$de.tudarmstadt.ukp.dkpro.core.doc-asl/doclinks/5/format-reference.html#format-Jdbc
For JDBC databases
JDBC
JAVA Database Connectivity
Superclass of JSON formats
JSON
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#GeniaJSONReader
JSON format of the Genia dataset
JSON/Genia
Data format encoding Linked Data using JSON
JSON/LD
http://kyoto-project.eu/xmlgroup.iit.cnr.it/kyoto/indexdd46.html?option=com_content&view=article&id=141&Itemid=130
KAF (also known as Knowledge Annotation Format) is a language neutral annotation format representing both morpho-syntactic and semantic annotation of documents through a stand-off multilayered structure
KAF
KYOTO Annotation Format
Knowledge Annotation Format
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#_kea_corpus_1
KEA-style (Keyphrase Extraction Algorithm) corpus
KEA corpus
adapted from wikipedia (https://en.wikipedia.org/wiki/Kernel_method)
Any method used in pattern analysis that relies on kernel functions, which enable it to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space.
Kernel Method
A word or group of words used to describe or index the contents of a document
Keyword
The task/process of identifying keywords (words deemed indicative of the topic/subject) in a text/corpus
Keyword extraction
A component that tries to extract keywords from a given text
Keyword extractor
The process/task of extracting, organising and systematising knowledge usually of a specific domain from external sources so that it can be used in a knowledge-based system
Knowledge acquisition
The task/process of automatically searching large volumes of data for patterns that can be considered knowledge about the data
Knowledge discovery
The task/process of representing information about entities in a form that machines are capable of understanding it
Knowledge Representation
The task/process of guessing what natural language a text or text segment is written in.
Language Identification
A component that identifies the language of a given text based on its contents
Language identifier
The construction of statistical or Machine Learning language models
Language modelling
https://www.latex-project.org/about/
Data format for documents using LaTeX (a high-quality typesetting system very popular for scientific documents)
LATEX
The canonical or citation form used for referring to a word and its inflected forms
Lemma
Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. [Wikipedia]
Lemmatization
Lemmatisation
A component that annotates the tokens of a text with lemma information
Lemmatizer
A relation holding between two or more words based on their meanings
Lexical semantic relation
Semantic relation
The task/process of accessing lexical/conceptual resources (either by humans or computer programs)
Lexicon access
The task/process of constructing lexical resources from corpora
Lexicon acquisition from corpora
The task/process of improving (e.g. increasing the size of entries, improving the information, adding new types of information, etc.) a lexicon
Lexicon enhancement
The task/process of constructing lexical resources based on the restructuring of lexical information contained in lexica (e.g. by parsing definitions or using syntactic information attached to lemmas)
Lexicon extraction from lexica
A component that extracts lexical information from corpora in order to produce structured lexical resources
Lexicon extractor from corpora
A component that extracts specific lexical information contained in other lexica
Lexicon extractor from lexica
The task/process of converting the format of a lexical/conceptual resource into another (e.g. from TSV to XML)
Lexicon format conversion
The task/process of merging (combining together) information coming from various lexical/conceptual resources
Lexicon merging
A component that supports humans in accessing the contents of a lexical/conceptual resource
Lexicon viewer
The task/process of viewing the contents of a lexicon as performed by human beings
Lexicon viewing
The task/process of visualizing information (e.g. using diagrams, 3-d images, word clouds etc.) contained in lexical/conceptual resources
Lexicon visualization
Lexicon visualisation
A branch of science (such as biology, medicine, and sometimes anthropology or sociology) that deals with living organisms and life processes —usually used in plural
Life sciences
Any operation that aims at the analysis of language or its structure
Linguistic analysis
Any kind of annotation pertaining to entities of linguistics; the use of OLIA is recommended
Linguistic entity
Formats used for linked data
Linked data format
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#_lll_1
Format of the LLL challenge
LLL
A word or group of words that denotes a geographical entity
Location
Methods and techniques used either in machine learning or statistical learning
Machine and Statistical Learning Method
adapted from https://www.sas.com/en_us/insights/analytics/machine-learning.html
A method of data analysis that automates model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.
Machine Learning Method
A component that is used in predicting based on machine learning models
Machine Learning predictor
maybe create another class for predictors, analytics
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.mallet-asl
Topic proportions in the shape [\t]\t\t...
Mallet LDA Topic Proportions
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-MalletTopicsProportionsSorted
Topic proportions in the shape [\t]\t\t... sorted
Mallet LDA Topic Proportions Sorted
Marker
Marker
A component that allows matching of elements
Matcher
The task/performance of identifying similar elements in two resources
Matching
The main means of mass communication (broadcasting, publishing, and the Internet) regarded collectively [https://en.oxforddictionaries.com/definition/media]
Media
https://www.mediawiki.org/wiki/Help:Formatting
Wiki markup for formatting
Media Wiki markup
Any substance involved in metabolism (= the chemical processes in the body needed for life) [https://dictionary.cambridge.org/dictionary/english/metabolite]
Metabolite
Research method
Method of research
Method of research
The IANA mimetype that can be used for the data format; it can be the exact or a broader mimetype; unofficial mimetypes are also used, but this will be revisited
Mimetype
Model organism/species
Model organism/species
The analysis of the structure of words and their relations to other words as regards their form and derivation
Morphological Analysis
The task/process of adding annotations pertaining to the morphological level of analysis (e.g. gender, number, person etc.)
Morphological annotation
Any type of annotation pertaining to the morphological level
Morphological annotation type
Property of a word that is expressed in its inflected form; examples include person, tense, gender, case etc.
Morphological feature
Grammatical category
Grammatical feature
Morphosyntactic feature
A component that annotates tokens of a text with morphological information (part-of-speech and morphological features)
Morphological tagger
The task/process of adding morphosyntactic tags to words in a text, i.e. part-of-speech and, optionally, morphological features per part-of-speech.
Morphosyntactic tagging
Morphosyntactic annotation
mdb
Data format for Microsoft Access database
MS-Access database
Data format for Microsoft Excel documents
MS-Excel
doc
Data format for Microsoft Word documents
MS-Word
A combination of words that are considered as forming one semantic unit
Multi-word unit
http://wordpress.let.vupr.nl/naf/
https://github.com/newsreader/NAF
The NAF format is linguistic annotation format designed for complex NLP pipelines. NAF combines strengths of the Linguistic Annotation Framework (LAF) as described in Ide et al. (2003) and the NLP Interchange Format (Hellman et al. 2013, NIF).
NAF
NLP Annotation Format
A component that seeks to locate and classify elements in a text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, discipline-specific classes, etc
Named entitity recognizer
A word or phrase referring to an entity, identified and annotated as such with a name (label); examples include organizations, persons, places etc.
Named entity
A subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
Named Entity Recognition
can also be annotation
http://www.coli.uni-saarland.de/~thorsten/publications/Brants-CLAUS98.pdf
Export format for annotated corpora in the NeGra project
NeGra export
A nerve cell that carries information between the brain and other parts of the body
Neuron
Any kind of annotation pertaining to entities of neuroscience
Neuroscience entity
http://persistence.uni-leipzig.org/nlp2rdf/
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations; it consists of specifications, ontologies and software (overview), which are combined under the version identifier "NIF 2.0", but are versioned individually
NIF
NLP Interchage Format
The task/process of editing a text in order to remove unwanted material (e.g. quotation marks, hyphenations etc.) or to substitute/represent specific items (tokens, dates, etc.) with normalized values
Normalization
A component that removes unwanted material from text (e.g. quotation marks, hyphenations etc.) or performs edits so that specific items (tokens, dates, etc.) are substituted/represented with normalized values
Normalizer
http://owlcollab.github.io/oboformat/doc/obo-syntax.html
Serialization format for ontologies according to the Open Biomedical Ontologies model.
OBO
Official text
Official text
The task/process of creating an ontology based on other resources (corpora, other lexical resources, etc.)
Ontology acquisition
The task/process of improving an ontology, typically by adding new relations or entities
Ontology enhancement
The action that a software program performs or is meant to perform
Operation
Function
Task
An individual animal, plant, or single-celled life form [https://en.oxforddictionaries.com/definition/organism]
Organism
A word or group of words that denotes an organization, such as company, association, institution etc.
Organization
Superclass for formats used for OWL
OWL
XML format for OWL ontologies
OWL/XML
A division of a text, usually about a single theme, consisting of one or more sentences and marked by a new line, indentation or other conventions.
Paragraph
The task/process of segmenting a text into paragraphs and marking their boundaries
Paragraph splitting
Paragraph segmentation
A task/process whereby a text fragment is reproduced with another text fragment that conveys the same or similar information
Paraphrasing
A component that takes as input text and returns a form of data structure (e.g. syntactic parse as a tree, or bracketed structure etc.)
Parser
Syntactic analyzer
The task/process of recognizing and marking the syntactic structure of a text or text segment
Parsing
Syntactic analysis
Syntactic annotation
Syntactic parsing
A division of words based on common grammatical features
Part of Speech
Grammatical category
Morphosyntactic category
Word category
A component that annotates tokens of a text with part-of-speech information
Part of speech tagger
PoS tagger
pdf
Data format for PDF files (Portable Document Format)
PDF
A word or group of words that refers to a person
Person
A word or phrase used for persuasion purposes
Persuasive expression
A component that tries to identify persuasive expressions in a given text
Persuasive expression miner
could also be an annotator
The task/process of identifying and extracting (especially from political speech texts) pieces of text that aim to persuade
Persuasive expression mining
The physical appearance or biochemical characteristic of an organism as a result of the interaction of its genotype and the environment [http://www.biology-online.org/dictionary/Phenotype]
Phenotype
A phrase is a syntactic structure that consists of more than one word but lacks the subject-predicate organization of a clause. [http://www.glossary.sil.org/term/phrase]
Phrase
Physical and chemical property of substances
Physico-chemical property
https://www.iana.org/assignments/media-types/application/pls+xml
Data format according to the Pronunciation Lexicon Specification (PLS)
PLS
http://ufal.mff.cuni.cz/jazz/PML/index_en.html
https://builds.openminted.eu/job/WP%205.2%20-%20Typesystem%20alignment/eu.openminted.interop$mapping-conversion/doclinks/1/components.html#_prague_markup_language_1
Format according to the Prague Markup Language (http://ufal.mff.cuni.cz/jazz/PML/index_en.html); PML is a generic data format based on XML intended for storing linguistically annotated data, such as the Prague Dependency Treebank, also annotation lexicons, etc.
PML
Prague Markup Language
A feature that distinguishes between positive, negative or neutral; in sentiment analysis, it refers to determining whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. [adapted from Wikipedia]
Polarity
The task/process of marking words with the part of speech (word category, e.g. noun, verb etc.) to which they belong
PoS Tagging
Grammatical annotation
Grammatical tagging
https://www.iana.org/assignments/media-types/application/postscript
ps
Data format for PostScript files
postscript
A component that is used at pre- or post-processing stages in order to normalize input/output
Pre- or Post-Processor
In Machine Learning, it refers to the use of algorithms that learn from previous data in order to make predictions on data (by estimating probabilities from previous data)
Prediction
needs better definition
A component that is used in processing operations
Processor
Any of various naturally occurring extremely complex substances that consist of amino-acid residues joined by peptide bonds, contain the elements carbon, hydrogen, nitrogen, oxygen, usually sulfur, and occasionally other elements (such as phosphorus or iron), and include many essential biological compounds (such as enzymes, hormones, or antibodies) [https://www.merriam-webster.com/dictionary/protein]
Protein
A protein family is a group of proteins that share a common evolutionary origin, reflected by their related functions and similarities in sequence or structure [https://www.ebi.ac.uk/training/online/course/introduction-protein-classification-ebi/protein-classification/what-are-protein-families]
Protein family
Penn Tree Bank formats
PTB
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20Core%20Documentation%20(GitHub)/de.tudarmstadt.ukp.dkpro.core$de.tudarmstadt.ukp.dkpro.core.doc-asl/doclinks/5/format-reference.html#format-PennTreebankChunked
Penn Treebank chunked format
PTB-chunked
Penn Treebank - chunked
ptb; format-variant=chunked
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-PennTreebankCombined
Penn Treebank combined format
PTB-combined
Penn Treebank - combined
ptb; format-variant=combined
https://gate.ac.uk/sale/tao/splitch23.html#sec:creole:pubmed
Textual format used for PubMed articles
PubMed
The task/process where computer systems try to automatically answer questions posed by users in the form of natural language.
Question Answering
The segment of a question that describes the entity about which the question is made
Question Topical Target
QTT
Question topic
https://www.w3.org/TR/REC-rdf-syntax/
Formats for RDF (Resource Description Framework) resources
RDF formats
https://www.w3.org/TR/REC-rdf-syntax/
Data format for RDF (Resource Description Framework) XML format; RDF/XML is a serialisation for RDF
RDF/XML
The ease with which a reader can understand a written text. [https://en.wikipedia.org/wiki/Readability]
Readability
A component that annotates the tokens of a text with readability scores
Readability annotator
A component that reads content of various types (pdf, txt, xml etc.)
Reader
Getting access to the contents of an input resource
Reading
A task/process that intends to recognize for two text fragments whether the meaning of one text is entailed in that of the other, i.e. whether the truth of one text fragment follows from that of the other fragment.
Recognizing Textual Entailment
Wikipedia (https://en.wikipedia.org/wiki/Regression_analysis)
A statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors').
Regression Analysis
adapted from Wikipedia (https://en.wikipedia.org/wiki/Regularization_(mathematics))
A process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting.
Regularisation
Regularization
Any type of relation that holds between two or more entities of a specific domain
Relation
The process/task of identifying and classifying relation mentions between entities in text and/or data.
Relation extraction
can also be annotation
Any operation that enables accessing a resource
Resource access
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-Reuters21578Sgml
Reuters-21578 corpus in SGML format
Reuters21578 SGML
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-Reuters21578Txt
Reuters-21578 corpus transformed into text format using ExtractReuters in the lucene-benchmarks project
Reuters21578 Txt
Ribonucleic acid
Any of various nucleic acids that contain ribose and uracil as structural components and are associated with the control of cellular chemical activities
RNA
rtf
Rich Text Format; proprietary data format of Microsoft
RTF
A type of method that makes use of set(s) of rules to perform the relevant task.
Rule-based Method
Scholarly communication can be defined as “the system through which research and other scholarly writings are created, evaluated for quality, disseminated to the scholarly community, and preserved for future use. The system includes both formal means of communication, such as publication in peer-reviewed journals, and informal channels, such as electronic listservs.” (Association of College & Research Libraries, “Principles and Strategies for the Reform of Scholarly Communication 1,” 2003)
Scholarly communication
Any type of annotation that is relevant to scholarly analtyics (e.g. citations, funding information etc.)
Scholarly analytics entity
Scientific unit
Scientific unit
Scientific value
Scientific value
A component that performs analysis tasks based on a script
Script-based analyser
The task/process of analysing a resource following a script
Script-based analysis
A component that segments a text into structural untis (chapters, paragraphs, sentences, words, tokens etc.)
Segmenter
can also be a support tool
Any type of annotation pertaining to the semantic level
Semantic annotation type
A component that annotates the tokens of a text with semantic features
Semantic annotator
A division of words into classes based on their common semantic features
Semantic class
Semantic type
A schematic representation of a situation involving various participants, props and other conceptual roles, each of which is a frame element
Semantic frame
Frame
A semantic role is the underlying relationship that a participant has with the main verb in a clause [http://www.glossary.sil.org/term/semantic-role]
Semantic role
A type of search that seeks to improve search accuracy by understanding the searcher's intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results.
Semantic search
A group of words, usually containing a verb, that expresses a thought in the form of a statement, question, instruction, or exclamation and starts with a capital letter when written [https://dictionary.cambridge.org/dictionary/english/sentence]
Sentence
A component that splits a text into sentences
Sentence splitter
The task/process of recognizing and tagging sentence boundaries in a text
Sentence splitting
Segmentation into sentences
The affective state (judgement, feeling) of a person or group towards an entity or event
Sentiment
Opinion
The process/task of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral
Sentiment analysis
Opinion extraction
Opinion mining
A component that tries to identify sentences that express the author’s negative or positive feelings on something
Sentiment analyzer
Opinion mining tool
could also be an annotator
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-SerializedCas
The CAS is the native data model used by UIMA; there are various ways of saving CAS data, using XMI, XCAS, or binary formats; this is for the serialized format
Serialized CAS
SGML format
SGML
A component that outputs a simpler rendition of a given item (sentence, text etc.)
Simplifier
The social sciences are academic disciplines concerned with the study of the social life of human groups and individuals including anthropology, economics, geography, history, political science, psychology, social studies, and sociology. The social sciences consist of the scientific study of the human aspects of the world.[https://en.wikipedia.org/wiki/Category:Social_sciences]
Social sciences
Any kind of annotation that pertains to entities of social sciences; the use of TheSoz is recommended
Social sciences entity
A technology that supports the development of software components and data resources required for their operation
Software development environment
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.solr-asl
Solr format
Solr
A set of animals or plants in which the members have similar characteristics to each other and can breed with each other
Species
Spectral data is essentially data derived by the use of spectroscopic instruments
Spectral data
A speech act is an act that a speaker performs when making an utterance, including the following: (a) A general act (illocutionary act) that a speaker performs, analyzable as including: the uttering of words (utterance acts), making reference and predicating (propositional acts), and a particular intention in making the utterance (illocutionary force). (b) An act involved in the illocutionary act, including utterance acts and propositional acts, (c) The production of a particular effect in the addressee (perlocutionary act) [http://www.glossary.sil.org/term/speech-act]
Speech act
A component that corrects spelling mistakes in a text
Spelling checker
"R: Statistical learning primer", https://www.pressreader.com/australia/linux-format/20161025/283626759591113
The process of using statistics and techniques related to statistics in order to understand and learn from your data so that you can predict its future.
Statistical Learning Method
A stem is the root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. [http://www.glossary.sil.org/term/stem]
Stem
A component that extracts stems from words in a text, usually by removing the most common morphological and inflectional endings from words
Stemmer
The task/process of cutting off the ends of words (mainly inflectional affixes but sometimes also derivational affixes) aiming to relate words to a base form.
Stemming
The task/process of segmenting a text and recognizing textual structural units (paragraphs, sentences, words etc.)
Structural annotation
Segmentation
Any type of annotation that pertains to the structure of a document
Structural annotation type
The number and types of syntactic arguments required by a certain lexical item (mainly verbs, but also nouns and adjectives)
Subcategorization frame
Argument structure
Subcategorisation frame
The linguistic expression of somebody’s opinions, sentiments, emotions, evaluations, beliefs, speculations (private states, i.e. states that are not open to objective observation or verification). [http://www.mavir.net/docs/JWiebe-Subjectivity-nov2010.pdf]
Subjectivity
The process/task of reducing one or more textual documents with a computer program in order to create a summary that retains the most important points of the original document(s).
Summarization
Summarisation
A component that produces a natural language synopsis of a longer text
Summarizer
A component that provides support to developers
Support component
Helper
Any operation that can support tasks that are accomplished through crowdsourcing
Support of crowdsourcing tasks
Collection of data, their transformation and organization into crowdsourcing units; automatic generation of reusable crowdsourcing interfaces for specific tasks (e.g. annotation)
Any operation that is used to support TDM tasks, either for creating TDM workflows or for executing them
Support operation
Any operation that intends to support the creation, curation or use of knowledge resource
Support operation for knowledge resources
A specialized structure or junction that allows cell to cell communication [http://www.biology-online.org/dictionary/Synapse]
Synapse
Any type of annotation that pertains to the syntactic level
Syntactic annotation type
A link between the syntactic unit and the semantic unit (sense) of a word
Syntactico-semantic link
Any format based on columns
Tabular format
https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format
An XML data exchange format developed within the WebLicht architecture to facilitate efficient interoperability between the tools; it allows the various linguistic annotations produced by the tools within WebLicht to be stored in one document; it supports incremental enrichment of linguistic annotations at various levels of analysis in a stand-off XML‐based format
TCF
Text Corpus Format
The method used by a TDM algorithm
TDM Method
http://www.tei-c.org/index.xml
Data format for TEI-encoded (Text Encoding Initiative) texts
TEI
Text Encoding Initiative
A linguistic expression (word, group of words, group of numbers etc.) that denotes time (a point in time, duration, frequency)
Temporal expression
The task/process of identifying temporal expressions (also called timex) in a text in order to extract temporal information
Temporal expression recognition
A term is a designation consisting of one or more words representing a general concept in a special language in a specific subject field [ISO 704:2009]
Term
The act/process of identifying and extracting candidate terms from a domain-specific corpus
Term extraction
Terminology extraction
A component that tries to extract terms from a corpus
Term extractor
Terminology extractor
https://en.wikipedia.org/wiki/TeX
Data format for documents using Tex (a typesetting system)
TEX
txt
Default value for the format of textual files; a textual file should be human-readable and must not contain binary data
Text
The process/task of converting unstructured text and data into high-quality structured data that can be further analysed to extract knowledge, support decision making etc.
Text and Data Analytics
The automated processing of unstructured text and/or structured data leading to the extraction of previously hidden knowledge.
Text and Data Mining
The process/task of adding annotations (notes or comments) to a text; in TDM, the annotations refer mainly to the interpretative linguistic information grounded in a knowledge resource that is added manually or automatically to a text
Text annotation
Linguistic annotation
The task/process of assigning documents into classes or categories
Text categorization
Document categorisation
Document categorization
Document classification
Text categorisation
Text classification
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.tgrep-gpl
Format for TGrep2 (search engine for searching syntactic parse trees represented as bracketed structures)
TGrep2
Theoretical frame
Theoretical frame
http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/TIGERSearch/doc/html/TigerXML.html
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.tiger-asl
The TIGER XML format was created for encoding syntactic constituency structures in the German TIGER corpus. It has since been used for many other corpora as well. TIGERSearch is a linguistic search engine specifically targetting this format. The format has later been extended to also support semantic frame annotations.
Tiger XML
http://www.ttt.org/oscarstandards/tmx/tmx14-20020710.htm
The purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools and/or translation vendors, while introducing little or no loss of critical data during the process.
TMX
Translation Memory Exchange
A set of characters surrounded by spaces or punctuation marks, as well as punctuation marks themselves
Token
The task/process of recognizing and tagging tokens (words, punctuation marks, digits etc.) in a text
Tokenization
A component that recognizes and tags tokens (words, punctuation marks, digits etc.) in a text
Tokenizer
The subject of a text or conversation, what it is about
Topic
The task/process of identifying the topic of a text or dataset (e.g. by clustering keywords or using topic models)
Topic detection
Topic extraction
A component that guesses the topic of a text
Topic extractor
A component that is used in training models for machine learning
Trainer of Machine Learning models
ML models trainer
ML trainer
Adapted from http://homepages.inf.ed.ac.uk/lzhang10/slm.html
The task/process of training (statistical) language models that that can estimate the distribution of natural language as accurately as possible.
Training of language models
adapted from http://docs.aws.amazon.com/machine-learning/latest/dg/training-ml-models.html
The task/process of creating Machine Learning (ML) models by providing a ML algorithm with training data that help the algorithm discover patterns in data, and construct the appropriate models using these discoveries
Training of Machine Learning models
Format for files with tab-separated values
TSV
Tab-separated values
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20Core%20Documentation%20(GitHub)/de.tudarmstadt.ukp.dkpro.core$de.tudarmstadt.ukp.dkpro.core.doc-asl/doclinks/5/format-reference.html#format-Tuepp
Format of the Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) XML files; TüPP D/Z (http://www.sfs.uni-tuebingen.de/de/ascl/ressourcen/corpora/tuepp-dz.html) is a collection of articles from the German newspaper taz (die tageszeitung) annotated and encoded in a XML format.
Tuepp
https://www.w3.org/TR/turtle/
Textual syntax for RDF that allows an RDF graph to be completely written in a compact and natural text form, with abbreviations for common usage patterns and datatypes.
Turtle
Formats used for the UIMA CAS (Common Analysis System) objects
UIMA CAS format
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-de.tudarmstadt.ukp.dkpro.core.io.json-asl
UIMA serialisation in JSON
UIMA/JSON
The task/process of confirming that a system/data resource meets the specifications and fulfills its intended purpose
Validation
A component used to confirm that a system/resource meets the specifications and fulfills its intended purpose
Validator
Variables detection component
A component that tries to identify variables (in social sciences) in a text
Variables dectector
A component that supports humans in accessing the contents of a resource
Viewer
The task/process of viewing the contents of a resource as performed by human beings
Viewing
A component or interface that renders the contents of a resource in a graphic way for visualisation purposes
Visualiser
The representation of an object, situation, or set of information as a chart, diagram or any other image that helps end users understand the contents or message
Visualization
Visualisation
https://catalog.ldc.upenn.edu/LDC2006T13
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20Core%20Documentation%20(GitHub)/de.tudarmstadt.ukp.dkpro.core$de.tudarmstadt.ukp.dkpro.core.doc-asl/doclinks/5/format-reference.html#format-Web1T
File format used by the Web1T n-gram corpus, a huge collection of n-grams collected from the internet.
Web1T
https://www.w3.org/TR/annotation-model/
A structured model and format to enable annotations to be shared and reused across different hardware and software platforms.
Web annotation format
Wheat-related species
Wheat-related species
Superclass for wiki formats
Wiki formats
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20Core%20Documentation%20(GitHub)/de.tudarmstadt.ukp.dkpro.core$de.tudarmstadt.ukp.dkpro.core.doc-asl/doclinks/5/format-reference.html#format-WikipediaArticle
Format for wikipedia articles
Wikipedia article
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-WikipediaArticleInfo
Format of general article infos
Wikipedia article info
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-WikipediaDiscussion
Format for wikipedia discussion pages
Wikipedia discussion
Formats used for wikipedia
Wikipedia format
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-WikipediaLink
Format for wikipedia links
Wikipedia link
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-WikipediaPage
Format of wikipedia pages in the database (articles, discussions, etc)
Wikipedia page
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-WikipediaQuery
Reads all article pages that match a query created by the numerous parameters of this class.
Wikipedia query
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-WikipediaRevision
Format for wikipedia revision pages
Wikipedia revision
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-WikipediaRevisionPair
Pairs of adjacent revisions of all articles
Wikipedia revision pair
https://dkpro.github.io/dkpro-core/releases/1.8.0/docs/format-reference.html#format-WikipediaTemplateFilteredArticle
Format for wikipedia pages that contain or do not contain the templates specified in the template whitelist and template blacklist
Wikipedia template filtered article
A word is a unit which is a constituent at the phrase level and above. It is sometimes identifiable according to such criteria as (a) being the minimal possible unit in a reply, (b) having features such as a regular stress pattern, and phonological changes conditioned by or blocked at word boundaries, (c) being the largest unit resistant to insertion of new constituents within its boundaries, or (d) being the smallest constituent that can be moved within a sentence without making the sentence ungrammatical. A word is sometimes placed, in a hierarchy of grammatical constituents, above the morpheme level and below the phrase level. [http://www.glossary.sil.org/term/word]
In annotation, words are often used as equivalent to tokens; thus, for instance, punctuation marks (traditionally not considered as words) will also be annotated as "word".
Word
The task/process of segmenting (cutting) a word into root and affixes
Word segmentation
Corresponds to the structural part of a lexical entry that contains the relevant semantic, grammatical, and anthropological information for a lexical unit. [adapted from http://www.glossary.sil.org/term/sense]
Word sense
The process/task of identifying which sense of a word with multiple meanings is used in a particular context; the selection of the sense is made from a list of the word's senses.
Word Sense Disambiguation
WSD
A component that tries to identify which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings
Word sense disambiguator
could also be an annotator and a support component (used in a lot of processes)
A component that writes processing results in various formats
Writer
The task/process of producing the output results of a process/workflow in various formats
Writing
Data format for documents and corpora using the XCES standard (Corpus Encoding Standard for XML), cf. http://www.xces.org/
XCES
A variant of XCES implemented for documents
XCES ILSP variant
xces; format-variant=ilsp
https://www.w3.org/TR/xhtml1/
html
Data format for XHTML (Extensible HyperText Markup Language)
XHTML
https://www.iana.org/assignments/media-types/application/vnd.xmi+xml
xmi
Data format for the XML Metadata Interchange (XMI), which is an Object Management Group (OMG) standard for exchanging metadata information via Extensible Markup Language (XML)
XMI
Superclass for grouping together XML formats
XML
http://bioc.sourceforge.net/
BioC is a simple format to share text data and annotations.
XML BioC
https://www.w3.org/TR/xpath/
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20Core%20Documentation%20(GitHub)/de.tudarmstadt.ukp.dkpro.core$de.tudarmstadt.ukp.dkpro.core.doc-asl/doclinks/5/format-reference.html#format-XmlXPath
XPath is a language for addressing parts of an XML document, designed to be used by both XSLT and XPointer.
XPath
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.morph.MorphologicalFeatures
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.ner.type.Date
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.ner.type.Event
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.ner.type.Location
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.ner.type.Organization
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.ner.type.Person
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Compound
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Stem
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.semantics.type.SemanticArgument
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.semantics.type.SemanticField
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.semantics.type.SemanticPredicate
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.semantics.type.WordSense
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.syntax.type.chunk.Chunk
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.syntax.type.constituent.Constituent
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.mallet.type.TopicDistribution
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.sentiment.type.StanfordSentimentAnnotation
http://dkpro.github.io/dkpro-core/releases/1.8.0/docs/typesystem-reference.html#de.tudarmstadt.ukp.dkpro.core.type.ReadabilityScore
application/emma+xml
application/json
application/ld+json
application/ld+json;profile="http://www.w3.org/ns/anno.jsonld"
application/msword
application/pdf
application/pls+xml
application/postscript
application/rdf+xml
application/rtf
application/tei+xml
application/vnd.ms-excel
application/vnd.msaccess
application/vnd.xmi+xml
application/x-latex
application/x-msaccess
application/x-tex
application/x-tmx+xml
application/x-xces+xml
application/x-xml+alvis
application/x.kaf+json
application/x.org.dkpro.graf+xml
application/x.org.dkpro.negra3
application/x.org.dkpro.negra4
application/x.org.dkpro.reuters21578+sgml
application/x.org.dkpro.tgrep2
application/x.org.dkpro.tiger+xml
application/x.org.dkpro.tuepp+xml
application/x.org.dkpro.uima+binary
application/x.org.dkpro.uima+json
application/xhtml+xml
application/xml
text/csv
text/html
text/sgml
text/tab-separated-values
text/tcf+xml
text/turtle
text/x-cochrane
text/x-pubmed
text/x.org.dkpro.conll-2000
text/x.org.dkpro.conll-2002
text/x.org.dkpro.conll-2003
text/x.org.dkpro.conll-2006
text/x.org.dkpro.conll-2008
text/x.org.dkpro.conll-2009
text/x.org.dkpro.conll-2012
text/x.org.dkpro.conll-u
text/x.org.dkpro.imscwb
text/x.org.dkpro.ngram
text/x.org.dkpro.ptb-chunked
text/x.org.dkpro.ptb-combined
text/x.org.dkpro.reuters21578
text/xml