OpenMinTeD

A processing resource that takes document and corpus parameters

GATE

Majority-vote consensus builder (annotation)

Example application for the linguistic simplifier

GATE

Lucene IR Engine

No description

GATE

Lupedia Service PR

Runs a lupedia annotation service on a GATE document

GATE

Process results of a crowd annotation task to find where annotators agree and disagree.

GATE

MergeLayers

Creates a new layer in each section containing all annotations in source layers.

AlvisNLP

MergeSections

Merge several sections into a single one.

AlvisNLP

MetaMap Annotator

This plugin uses the MetaMap Java API to send GATE document content to MetaMap skrmedpostctl server and PrologBeans mmserver instances running on the given machine/port

GATE

MetaphonePhoneticTranscriptor

Metaphone phonetic transcription based on Apache Commons Codec.

DKPro Core (UIMA)

MutationFinder

GATE MutationFinder Wrapper

GATE

NGramAnnotator

N-gram annotator.

DKPro Core (UIMA)

NGrams

Computes annotation n-grams.

AlvisNLP

NeMine

No description

NaCTeM (UIMA)

NewCount

Counts element occurrences and writes the results in a file, including tfidf.

AlvisNLP

OBOMapper

synopsis

AlvisNLP

OBOProjector

Projects OBO terms and synonyms on sections.

AlvisNLP

OWLIM Ontology

Ontology created as a temporary OWLIM3 in-memory repository

GATE

OWLIM Ontology DEPRECATED

Ontology created as a temporary OWLIM3 in-memory repository, for backwards compatibility only

GATE

OntoReif

synopsis

AlvisNLP

OpenNLPNEDetector

Detects named entities in text and creates corresponding entity annotations that span the found entities.

NaCTeM (UIMA)

OpenNLPSentenceDetector

Detect sentence boundaries and create sentence annotations that span these boundaries.

NaCTeM (UIMA)

OrthoRef

An orthographic coreferencer

GATE

OscarMER

Runs Oscar 3 with maximum entropy based recogniser with syntactic tokens as input

NaCTeM (UIMA)

PMI Bank

Pointwise Mutual Information from corpora

GATE

PMI Example (English)

Example application for the PMI (pointwise mutual information) tool

GATE

PatternMatcher

Matches a regular expression-like pattern on the sequence of annotations in a given layer.

AlvisNLP

ProminentConceptReporter

synopsis

AlvisNLP

Quality Assurance PR

The Quality Assurance PR provides a functionality of the Corpus QA Tool in GATE Developer

GATE

QuickHTML

synopsis

AlvisNLP

RO_FDGBank

This reader performs the transformation of the CONLL tab separated text format to the CAS ConllDependency format.

NaCTeM (UIMA)

Reference Evaluator

Reports annotation performance comparing views (sofas) to one selected reference view.

NaCTeM (UIMA)

RegExp

Matches a regular expression on sections contents and create an annotation for each match.

AlvisNLP

Regex Annotator

Annotates spans of text based on a custom regular expression.

NaCTeM (UIMA)

RemoveContents

synopsis

AlvisNLP

RemoveEquivalent

Removes duplicate elements.

AlvisNLP

RemoveOverlaps

Removes overlapping annotations from a given layer.

AlvisNLP

Romanian Transducer

A module for executing Jape grammars

GATE

SFTP BioNLP Shared Task Data Provider

Reads a corpus in BioNLP Shared Task format from a remote directory on a user-specified server via SFTP.

NaCTeM (UIMA)

SQLImport

synopsis

AlvisNLP

SeSMig

Detects sentence boundaries and creates one annotation for each sentence.This module assumes WoSMig processed the same sections.

AlvisNLP

Search Results

Viewer for IR search results

GATE

SearchPR

Provides IR functionality.

GATE

Sequence_Impl

Sequence of modules.

AlvisNLP

Show/Hide Resources

Show resources that would otherwise be hidden, e.g. resources created for internal use by other resources

GATE

SimpleProjector

Projects a simple dictionary on sections.

AlvisNLP

SimpleProjector2

Projects a simple dictionary on sections.

AlvisNLP

SoundexPhoneticTranscriptor

Soundex phonetic transcription based on Apache Commons Codec.

DKPro Core (UIMA)

Species

Calls the Species taxon tagger.

AlvisNLP

SplitOverlaps

Splits overlapping annotations.

AlvisNLP

TermRaider English Term Extraction

Example application showing typical set-up for the TermRaider tools

GATE

Termbank Score Copier

Copy scores from Termbanks back to their source annotations

GATE

TextRazor Service PR

Runs the TextRazor annotation service (http://textrazor.com) on a GATE document

GATE

TfIdfTermbank

TermRaider Termbank derived from vectors in document features

GATE

TfidfAnnotator

This component adds Tfidf annotations consisting of a term and a tfidf weight.

DKPro Core (UIMA)

TomapProjector

synopsis

AlvisNLP

TomapTrain

synopsis

AlvisNLP

TyDIProjector

Projects terms from a TiDI export.

AlvisNLP

Type Mapper

No description

NaCTeM (UIMA)

UAICDiacriticsDescriptor

No description

NaCTeM (UIMA)

UAICLemmav1

Assigns base forms to tokenised text.

NaCTeM (UIMA)

UAICLemmav2

Assigns base forms in Romanian text, given POS-tagged text.

NaCTeM (UIMA)

UAICSegV1

Splits texts into fragments

NaCTeM (UIMA)

UMLS Full Dictionary Feature Extractor

Extracts Dictionary features from a UMLS-sourced dictionary

NaCTeM (UIMA)

WapitiLabel

synopsis

AlvisNLP

WapitiTrain

synopsis

AlvisNLP

WoSMig

Performs word segmentation on section contents.

AlvisNLP

WordNet

GATE

WordNet 1.6

Princeton WordNet 1.6.

GATE

YateaProjector

synopsis

AlvisNLP

Zemanta Service PR

Runs a zemanta annotation service on a GATE document

GATE

Chunker (7)

Component	Description	Framework
ANNIE VP Chunker	ANNIE VP Chunker component.	GATE
ILSP Chunker	No description	ILSP (UIMA)
Noun Phrase Chunker	Ready-made NP chunking application	GATE
Noun Phrase Chunker	Implementation of the Ramshaw and Marcus base noun phrase chunker	GATE
OpenNLP Chunker	Chunker using an OpenNLP maxent model	GATE
OpenNlpChunker	Chunk annotator using OpenNLP.	DKPro Core (UIMA)
TreeTaggerChunker	Chunk annotator using TreeTagger.	DKPro Core (UIMA)

Component

Description

Framework

ANNIE VP Chunker

ANNIE VP Chunker component.

GATE

ILSP Chunker

No description

ILSP (UIMA)

Ready-made NP chunking application

GATE

Entity Classification Job Builder

Implementation of the Ramshaw and Marcus base noun phrase chunker

GATE

OpenNLP Chunker

Chunker using an OpenNLP maxent model

GATE

OpenNlpChunker

Chunk annotator using OpenNLP.

DKPro Core (UIMA)

TreeTaggerChunker

Chunk annotator using TreeTagger.

DKPro Core (UIMA)

Classifier (8)

Component	Description	Framework
Entity Classification Job Builder	Build a CrowdFlower job asking users to select the right label for entities	GATE
Entity Classification Results Importer	Import judgments from a CrowdFlower job created by the Entity Classification Job Builder as GATE annotations.	GATE
Majority-vote consensus builder (classification)	Process results of a crowd annotation task to find where annotators agree and disagree.	GATE
SelectingElementClassifier	Searches for discrimminating attributes with Weka.	AlvisNLP
TaggingElementClassifier	Classifies elements with a Weka classifier.	AlvisNLP
Text Categorization PR	Classify text based on a semantic space	GATE
Textalytics Text Classification	Textalytics Text Classification	GATE
TrainingElementClassifier	Trains a Weka classifier where examples are elements.	AlvisNLP

Component

Description

Framework

Build a CrowdFlower job asking users to select the right label for entities

GATE

Entity Classification Results Importer

Import judgments from a CrowdFlower job created by the Entity Classification Job Builder as GATE annotations.

GATE

Majority-vote consensus builder (classification)

Process results of a crowd annotation task to find where annotators agree and disagree.

GATE

SelectingElementClassifier

Searches for discrimminating attributes with Weka.

AlvisNLP

TaggingElementClassifier

Classifies elements with a Weka classifier.

AlvisNLP

Text Categorization PR

Classify text based on a semantic space

GATE

Textalytics Text Classification

GATE

TrainingElementClassifier

Trains a Weka classifier where examples are elements.

AlvisNLP

Coreference (3)

Component	Description	Framework
ANNIE Nominal Coreferencer	Nominal Coreference resolution component	GATE
ANNIE Pronominal Coreferencer	Pronominal Coreference resolution component.	GATE
StanfordCoreferenceResolver	No description	DKPro Core (UIMA)

Component

Description

Framework

ANNIE Nominal Coreferencer

Nominal Coreference resolution component

GATE

ANNIE Pronominal Coreferencer

Pronominal Coreference resolution component.

GATE

StanfordCoreferenceResolver

No description

DKPro Core (UIMA)

CrowdSourcing (1)

Component	Description	Framework
Entity Annotation Job Builder	Build a CrowdFlower job asking users to annotate entities within a snippet of text	GATE

Component

Description

Framework

Entity Annotation Job Builder

Build a CrowdFlower job asking users to annotate entities within a snippet of text

GATE

Developers/Debugging (9)

Component	Description	Framework
DependencyDumper	Dump dependencies to screen.	DKPro Core (UIMA)
DocumentMetaDataStripper	Removes fields from the document meta data which may be different depending on the machine a test is run on.	DKPro Core (UIMA)
EDT Monitor	Warns whenever an AWT component is updated from anywhere other than the event dispatch thread	GATE
JCasHolder	Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.	DKPro Core (UIMA)
Java Heap Dumper	Dumps the Java heap to the specified file	GATE
Log4J Level: ALL	Allows the Log4J log level to be set to ALL from within the GUI	GATE
Stopwatch	Can be used to measure how long the processing between two points in a pipeline takes.	DKPro Core (UIMA)
TagsetDescriptionStripper	Copyright 2012 Ubiquitous Knowledge Processing (UKP) Lab Technische Universität Darmstadt Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.	DKPro Core (UIMA)
Unload Unused Plugins	Unloads all plugins for which we cannot find any loaded instances	GATE

Component

Description

Framework

DependencyDumper

Dump dependencies to screen.

DKPro Core (UIMA)

DocumentMetaDataStripper

Removes fields from the document meta data which may be different depending on the machine a test is run on.

DKPro Core (UIMA)

EDT Monitor

Warns whenever an AWT component is updated from anywhere other than the event dispatch thread

GATE

JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

DKPro Core (UIMA)

Java Heap Dumper

Dumps the Java heap to the specified file

GATE

Log4J Level: ALL

Allows the Log4J log level to be set to ALL from within the GUI

GATE

Stopwatch

Can be used to measure how long the processing between two points in a pipeline takes.

DKPro Core (UIMA)

TagsetDescriptionStripper

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.

DKPro Core (UIMA)

Unload Unused Plugins

Unloads all plugins for which we cannot find any loaded instances

GATE

Evaluation (2)

Component	Description	Framework
CompareElements	Compares two sets of elements.	AlvisNLP
IAA Computation PR	Compute inter-annotator agreement (IAA).	GATE

Component

Description

Framework

CompareElements

Compares two sets of elements.

AlvisNLP

IAA Computation PR

Compute inter-annotator agreement (IAA).

GATE

Filtering (6)

Component	Description	Framework
AnnotationByLengthFilter	Removes annotations that do not conform to minimum or maximum length constraints.	DKPro Core (UIMA)
AnnotationByTextFilter	Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.	DKPro Core (UIMA)
Boilerpipe Content Detection	Uses boilerpipe to determine which sections of a document are interesting content and which are just boilerplate	GATE
PosFilter	Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.	DKPro Core (UIMA)
RegexTokenFilter	Remove every token that does or does not match a given regular expression.	DKPro Core (UIMA)
StopWordRemover	Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.	DKPro Core (UIMA)

Component

Description

Framework

AnnotationByLengthFilter

Removes annotations that do not conform to minimum or maximum length constraints.

DKPro Core (UIMA)

AnnotationByTextFilter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

DKPro Core (UIMA)

Boilerpipe Content Detection

Uses boilerpipe to determine which sections of a document are interesting content and which are just boilerplate

GATE

PosFilter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

DKPro Core (UIMA)

RegexTokenFilter

Remove every token that does or does not match a given regular expression.

DKPro Core (UIMA)

StopWordRemover

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.

DKPro Core (UIMA)

Flow (8)

Component	Description	Framework
Annotation Merging PR	Merge Annotations from different annotators.	GATE
Annotation Set Transfer	Annotation set transfer component.	GATE
Combine Members PR	Combines documents in a composite document.	GATE
Delete Member PR	Deletes one member document from a compound doc.	GATE
Document Reset PR	Remove named annotation sets or reset the default annotation set	GATE
Scriptable Controller	A controller whose execution strategy is controlled by a Groovy script	GATE
Segment Processing PR	Processes individual segments as separate documents	GATE
Switch Member PR	Sets the focus of a compound document to a specified member document.	GATE

Component

Description

Framework

Annotation Merging PR

Merge Annotations from different annotators.

GATE

Annotation Set Transfer

Annotation set transfer component.

GATE

Combine Members PR

Combines documents in a composite document.

GATE

Delete Member PR

Deletes one member document from a compound doc.

GATE

Document Reset PR

Remove named annotation sets or reset the default annotation set

GATE

Scriptable Controller

A controller whose execution strategy is controlled by a Groovy script

GATE

Segment Processing PR

Processes individual segments as separate documents

GATE

Switch Member PR

Sets the focus of a compound document to a specified member document.

GATE

Gazetteer (16)

Component	Description	Framework
ANNIE Gazetteer	A list lookup component.	GATE
Arabic Gazetteer	A list lookup component.	GATE
Arabic Infered Gazetteer	A list lookup component.	GATE
Cebuano Gazetteer	A list lookup component.	GATE
DictionaryAnnotator	Takes a plain text file with phrases as input and annotates the phrases in the CAS file.	DKPro Core (UIMA)
Flexible Gazetteer	A more flexible list lookup component.	GATE
Hash Gazetteer	A list lookup component implemented by OntoText Lab.	GATE
Hindi Gazetteer	A list lookup component.	GATE
Hindi Tokeniser Gazetteer	A list lookup component.	GATE
Inflectional gazetteer	Gazetteer with support for inflectional morphology	GATE
Large KB Gazetteer	KIM KB based alias-lookup commponent	GATE
Onto Root Gazetteer	An ontology lookup component	GATE
OntoGazetteer	A list lookup component based on mapping between ontology classes and gazetteer lists.	GATE
Romanian Gazetteer	A list lookup component.	GATE
Russian Gazetteer	Customised version of the hash gazetteer	GATE
Sharable Gazettee	A list lookup component.	GATE

Component

Description

Framework

ANNIE Gazetteer

A list lookup component.

GATE

Arabic Gazetteer

A list lookup component.

GATE

Arabic Infered Gazetteer

A list lookup component.

GATE

Cebuano Gazetteer

A list lookup component.

GATE

DictionaryAnnotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file.

DKPro Core (UIMA)

Flexible Gazetteer

A more flexible list lookup component.

GATE

Hash Gazetteer

A list lookup component implemented by OntoText Lab.

GATE

Hindi Gazetteer

A list lookup component.

GATE

Hindi Tokeniser Gazetteer

A list lookup component.

GATE

Inflectional gazetteer

Gazetteer with support for inflectional morphology

GATE

Large KB Gazetteer

KIM KB based alias-lookup commponent

GATE

Onto Root Gazetteer

An ontology lookup component

GATE

OntoGazetteer

A list lookup component based on mapping between ontology classes and gazetteer lists.

GATE

Romanian Gazetteer

A list lookup component.

GATE

Russian Gazetteer

Customised version of the hash gazetteer

GATE

Sharable Gazettee

A list lookup component.

GATE

Irrelevant (1)

Component	Description	Framework
The Duplicator	Duplicate any resource with a right click menu option	GATE

Component

Description

Framework

The Duplicator

Duplicate any resource with a right click menu option

GATE

Keywords/Terms (3)

Component	Description	Framework
KEA Keyphrase Extractor	A Keyphrase Extractor by Eibe Frank.	GATE
KeywordsSelector	Selects most relevant keywords in documents.	AlvisNLP
YateaExtractor	Extract terms from the corpus using the YaTeA term extractor.	AlvisNLP

Component

Description

Framework

KEA Keyphrase Extractor

A Keyphrase Extractor by Eibe Frank.

GATE

KeywordsSelector

Selects most relevant keywords in documents.

AlvisNLP

YateaExtractor

Extract terms from the corpus using the YaTeA term extractor.

AlvisNLP

Language Identifier (7)

Component	Description	Framework
LangDetectLanguageIdentifier	Langdetect language identifier based on character n-grams.	DKPro Core (UIMA)
LanguageDetectorWeb1T	Language detector based on n-gram frequency counts, e.g. as provided by Web1T	DKPro Core (UIMA)
LanguageIdentifier	Detection based on character n-grams.	DKPro Core (UIMA)
LingPipe Language Identifier PR	GATE PR for language identification using LingPipe	GATE
TextCat Fingerprint Generator	Generate language fingerprints for use with the TextCat Language Indentification PR	GATE
TextCat Language Identification	Recognizes the document language using TextCat	GATE
Textalytics Language Identification	Textalytics Language Identification	GATE

Component

Description

Framework

LangDetectLanguageIdentifier

Langdetect language identifier based on character n-grams.

DKPro Core (UIMA)

LanguageDetectorWeb1T

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

DKPro Core (UIMA)

LanguageIdentifier

Detection based on character n-grams.

DKPro Core (UIMA)

LingPipe Language Identifier PR

GATE PR for language identification using LingPipe

GATE

TextCat Fingerprint Generator

Generate language fingerprints for use with the TextCat Language Indentification PR

GATE

TextCat Language Identification

Recognizes the document language using TextCat

GATE

Textalytics Language Identification

GATE

Lemmatizer (7)

Component	Description	Framework
ClearNlpLemmatizer	Lemmatizer using Clear NLP.	DKPro Core (UIMA)
GateLemmatizer	Wrapper for the GATE rule based lemmatizer.	DKPro Core (UIMA)
ILSP Lemmatizer	ILSP Lemmatizer consults a assigns lemmas to tokens from Greek texts.	ILSP (UIMA)
LanguageToolLemmatizer	Naive lexicon-based lemmatizer.	DKPro Core (UIMA)
MateLemmatizer	DKPro Annotator for the MateToolsLemmatizer.	DKPro Core (UIMA)
MorphaLemmatizer	Lemmatize based on a finite-state machine.	DKPro Core (UIMA)
StanfordLemmatizer	Stanford Lemmatizer component.	DKPro Core (UIMA)

Component

Description

Framework

ClearNlpLemmatizer

Lemmatizer using Clear NLP.

DKPro Core (UIMA)

GateLemmatizer

Wrapper for the GATE rule based lemmatizer.

DKPro Core (UIMA)

ILSP Lemmatizer

ILSP Lemmatizer consults a assigns lemmas to tokens from Greek texts.

ILSP (UIMA)

LanguageToolLemmatizer

Naive lexicon-based lemmatizer.

DKPro Core (UIMA)

MateLemmatizer

DKPro Annotator for the MateToolsLemmatizer.

DKPro Core (UIMA)

MorphaLemmatizer

Lemmatize based on a finite-state machine.

DKPro Core (UIMA)

StanfordLemmatizer

Stanford Lemmatizer component.

DKPro Core (UIMA)

Machine Learning (2)

Component	Description	Framework
Batch Learning PR	Supports training, application and evaluation of machine learning models for NLP tasks	GATE
Machine Learning PR	Trains a machine learning algorithm from a corpus.	GATE

Component

Description

Framework

Batch Learning PR

Supports training, application and evaluation of machine learning models for NLP tasks

GATE

Machine Learning PR

Trains a machine learning algorithm from a corpus.

GATE

MorphTagger (3)

Component	Description	Framework
GATE Morphological analyser	Morphological Analyzer for the English Language.	GATE
RASP2 Morphological Analyser	RASP morphological analyser, which adds lemma and suffix to the WordForm annotations produced by the RASP POS tagger (or the ANNIE POS tagger plus the RASP converter)	GATE
SfstAnnotator	Sfst morphological analyzer.	DKPro Core (UIMA)

Component

Description

Framework

GATE Morphological analyser

Morphological Analyzer for the English Language.

GATE

RASP2 Morphological Analyser

RASP morphological analyser, which adds lemma and suffix to the WordForm annotations produced by the RASP POS tagger (or the ANNIE POS tagger plus the RASP converter)

GATE

SfstAnnotator

Sfst morphological analyzer.

DKPro Core (UIMA)

Named Entity Recognizer (11)

Component	Description	Framework
ABNER	Wraps the ABNER entity identification system into the UIMA framework.	NaCTeM (UIMA)
CRF++ Trainer	Produces a Conditional Random Fields model.	NaCTeM (UIMA)
ILSP NERC	This module uses a Maximum Entropy NER engine focusing on EL or EN textual newsy data.	ILSP (UIMA)
LingPipe NER PR	LingPipe Named Entity Recognizer	GATE
OpenNLP NER	NER PR using a set of OpenNLP maxent models	GATE
OpenNlpNamedEntityRecognizer	OpenNLP name finder wrapper.	DKPro Core (UIMA)
SVMLight Trainer	Produces an SVMLight model based on user-specified learning parameters.	NaCTeM (UIMA)
Stanford NER	Stanford Named Entity Recogniser	GATE
StanfordNER	synopsis	AlvisNLP
StanfordNamedEntityRecognizer	Stanford Named Entity Recognizer component.	DKPro Core (UIMA)
Yeast Metabliner	This service is to annotate yeast metabolites with a supervised NER system using CRF.	NaCTeM (UIMA)

Component

Description

Framework

ABNER

Wraps the ABNER entity identification system into the UIMA framework.

NaCTeM (UIMA)

CRF++ Trainer

Produces a Conditional Random Fields model.

NaCTeM (UIMA)

ILSP NERC

This module uses a Maximum Entropy NER engine focusing on EL or EN textual newsy data.

ILSP (UIMA)

LingPipe NER PR

LingPipe Named Entity Recognizer

GATE

OpenNLP NER

NER PR using a set of OpenNLP maxent models

GATE

OpenNlpNamedEntityRecognizer

OpenNLP name finder wrapper.

DKPro Core (UIMA)

SVMLight Trainer

Produces an SVMLight model based on user-specified learning parameters.

NaCTeM (UIMA)

Stanford NER

Stanford Named Entity Recogniser

GATE

StanfordNER

synopsis

AlvisNLP

StanfordNamedEntityRecognizer

Stanford Named Entity Recognizer component.

DKPro Core (UIMA)

Yeast Metabliner

This service is to annotate yeast metabolites with a supervised NER system using CRF.

NaCTeM (UIMA)

Normalizer (19)

Component	Description	Framework
ApplyChangesAnnotator	Applies changes annotated using a SofaChangeAnnotation.	DKPro Core (UIMA)
Backmapper	After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.	DKPro Core (UIMA)
CapitalizationNormalizer	Takes a text and replaces wrong capitalization	DKPro Core (UIMA)
CjfNormalizer	Converts traditional Chinese to simplified Chinese or vice-versa.	DKPro Core (UIMA)
Date Annotation Normalizer	provides normalized values for all existing date annotations	GATE
Date Normalizer	provides normalized values for all known dates	GATE
DictionaryBasedTokenTransformer	Reads a tab-separated file containing mappings from one token to another.	DKPro Core (UIMA)
Document normalizer	Normalize document content to remove "smart quotes" etc.	GATE
ExpressiveLengtheningNormalizer	Takes a text and shortens extra long words	DKPro Core (UIMA)
FileBasedTokenTransformer	Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.	DKPro Core (UIMA)
HyphenationRemover	Simple dictionary-based hyphenation remover.	DKPro Core (UIMA)
RegexBasedTokenTransformer	A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.	DKPro Core (UIMA)
ReplacementFileNormalizer	Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens	DKPro Core (UIMA)
SharpSNormalizer	Takes a text and replaces sharp s	DKPro Core (UIMA)
SpellingNormalizer	Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.	DKPro Core (UIMA)
StanfordPtbTransformer	Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.	DKPro Core (UIMA)
TokenCaseTransformer	Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.	DKPro Core (UIMA)
Tweet Normaliser	Normalise texts in tweets (convert into standard English spelling mistakes, colloquialisms, typing variations and so on)	GATE
UmlautNormalizer	Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.	DKPro Core (UIMA)

Component

Description

Framework

ApplyChangesAnnotator

Applies changes annotated using a SofaChangeAnnotation.

DKPro Core (UIMA)

Backmapper

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

DKPro Core (UIMA)

CapitalizationNormalizer

Takes a text and replaces wrong capitalization

DKPro Core (UIMA)

CjfNormalizer

Converts traditional Chinese to simplified Chinese or vice-versa.

DKPro Core (UIMA)

Date Annotation Normalizer

provides normalized values for all existing date annotations

GATE

Date Normalizer

provides normalized values for all known dates

GATE

DictionaryBasedTokenTransformer

Reads a tab-separated file containing mappings from one token to another.

DKPro Core (UIMA)

Document normalizer

Normalize document content to remove "smart quotes" etc.

GATE

ExpressiveLengtheningNormalizer

Takes a text and shortens extra long words

DKPro Core (UIMA)

FileBasedTokenTransformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

DKPro Core (UIMA)

HyphenationRemover

Simple dictionary-based hyphenation remover.

DKPro Core (UIMA)

RegexBasedTokenTransformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

DKPro Core (UIMA)

ReplacementFileNormalizer

Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens

DKPro Core (UIMA)

SharpSNormalizer

Takes a text and replaces sharp s

DKPro Core (UIMA)

SpellingNormalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

DKPro Core (UIMA)

StanfordPtbTransformer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.

DKPro Core (UIMA)

TokenCaseTransformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

DKPro Core (UIMA)

Tweet Normaliser

Normalise texts in tweets (convert into standard English spelling mistakes, colloquialisms, typing variations and so on)

GATE

UmlautNormalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

DKPro Core (UIMA)

Parser (24)

Component	Description	Framework
BerkeleyParser	Berkeley Parser annotator .	DKPro Core (UIMA)
CCGParser	Syntax parsing with CCG Parser.	AlvisNLP
ClearNlpParser	Clear parser annotator.	DKPro Core (UIMA)
English Dependency Parser	Ready-made application for Stanford English parser	GATE
English POS Tagger and Dependency Parser	Ready-made application for Stanford English POS tagger and parser	GATE
Enju Parser	A syntactic parser for English.	NaCTeM (UIMA)
EnjuParser	Parses sentences with the ENJU dependency parser.	AlvisNLP
EnjuParser2	synopsis	AlvisNLP
FreelingShallowParser	Performs tokenisation, lemmatisation, POS tagging and shallow parsing (chunking).	NaCTeM (UIMA)
GENIA Dependency Parser	A dependency parser for biomedical text.	NaCTeM (UIMA)
ILSP Dependency Parser	ILSP Dependency Parser is a tool trained on the Greek Dependency Treebank (Prokopidis et al., 2005), a resource which comprises data annotated at several linguistic levels.	ILSP (UIMA)
MaltParser	Dependency parsing using MaltPaser.	DKPro Core (UIMA)
MateParser	DKPro Annotator for the MateToolsParser.	DKPro Core (UIMA)
Minipar Wrapper	MiniPar is a shallow parser.	GATE
MstParser	Dependency parsing using MSTParser.	DKPro Core (UIMA)
OpenNLP Parser	Syntactic parser from Apache OpenNLP	GATE
OpenNLPParser	Parse the document and create phrasal and clausal annotations over the text.	NaCTeM (UIMA)
OpenNlpParser	OpenNLP parser.	DKPro Core (UIMA)
RASP2 Parser	RASP dependency parser	GATE
Stanford Dependency Parser	Generates Stanford-style dependencies together with POS tokens for English.	NaCTeM (UIMA)
StanfordDependencyConverter	Converts a constituency structure into a dependency structure.	DKPro Core (UIMA)
StanfordParser	Stanford parser wrapper	GATE
StanfordParser	Stanford Parser component.	DKPro Core (UIMA)
_PoS_and_Parsing,Textalytics Lemmatization, PoS and Parsing	Textalytics Lemmatization, PoS and Parsing	GATE

Component

Description

Framework

BerkeleyParser

Berkeley Parser annotator .

DKPro Core (UIMA)

CCGParser

Syntax parsing with CCG Parser.

AlvisNLP

ClearNlpParser

Clear parser annotator.

DKPro Core (UIMA)

English Dependency Parser

Ready-made application for Stanford English parser

GATE

English POS Tagger and Dependency Parser

Ready-made application for Stanford English POS tagger and parser

GATE

Enju Parser

A syntactic parser for English.

NaCTeM (UIMA)

EnjuParser

Parses sentences with the ENJU dependency parser.

AlvisNLP

EnjuParser2

synopsis

AlvisNLP

FreelingShallowParser

Performs tokenisation, lemmatisation, POS tagging and shallow parsing (chunking).

NaCTeM (UIMA)

GENIA Dependency Parser

A dependency parser for biomedical text.

NaCTeM (UIMA)

ILSP Dependency Parser

ILSP Dependency Parser is a tool trained on the Greek Dependency Treebank (Prokopidis et al., 2005), a resource which comprises data annotated at several linguistic levels.

ILSP (UIMA)

MaltParser

Dependency parsing using MaltPaser.

DKPro Core (UIMA)

MateParser

DKPro Annotator for the MateToolsParser.

DKPro Core (UIMA)

Minipar Wrapper

MiniPar is a shallow parser.

GATE

MstParser

Dependency parsing using MSTParser.

DKPro Core (UIMA)

OpenNLP Parser

Syntactic parser from Apache OpenNLP

GATE

OpenNLPParser

Parse the document and create phrasal and clausal annotations over the text.

NaCTeM (UIMA)

OpenNlpParser

OpenNLP parser.

DKPro Core (UIMA)

RASP2 Parser

RASP dependency parser

GATE

Stanford Dependency Parser

Generates Stanford-style dependencies together with POS tokens for English.

NaCTeM (UIMA)

StanfordDependencyConverter

Converts a constituency structure into a dependency structure.

DKPro Core (UIMA)

Stanford parser wrapper

GATE

_PoS_and_Parsing,Textalytics Lemmatization, PoS and Parsing

Stanford Parser component.

DKPro Core (UIMA)

Textalytics Lemmatization, PoS and Parsing

GATE

Pre-built Workflows (12)

Component	Description	Framework
Arabic IE System	Ready-made Arabic IE application	GATE
Cebuano IE System	Ready-made Cebuano IE application	GATE
Chinese IE System	Ready-made Chinese IE application	GATE
French IE System	Ready-made French IE application	GATE
German IE System	Ready-made German IE application	GATE
Measurements	Ready-made application for measurement annotator	GATE
Romanian IE System	Ready-made Romanian IE application	GATE
RussIE	Basic version of the RussIE application	GATE
RussIE + Inflectional Gazetteer & OrthoMatcher	RussIE application with orthomatcher and inflexional gazetteer	GATE
RussIE + Inflectional Gazetter	RussIE application with inflexional gazetteer	GATE
RussIE + OrthoMatcher	RussIE application with orthomatcher	GATE
TwitIE (EN)	English TwitIE application	GATE

Component

Description

Framework

Arabic IE System

Ready-made Arabic IE application

GATE

Cebuano IE System

Ready-made Cebuano IE application

GATE

Chinese IE System

Ready-made Chinese IE application

GATE

French IE System

Ready-made French IE application

GATE

German IE System

Ready-made German IE application

GATE

Measurements

Ready-made application for measurement annotator

GATE

Romanian IE System

Ready-made Romanian IE application

GATE

RussIE

Basic version of the RussIE application

GATE

RussIE + Inflectional Gazetteer & OrthoMatcher

RussIE application with orthomatcher and inflexional gazetteer

GATE

RussIE + Inflectional Gazetter

RussIE application with inflexional gazetteer

GATE

RussIE + OrthoMatcher

RussIE application with orthomatcher

GATE

TwitIE (EN)

English TwitIE application

GATE

Readability (1)

Component	Description	Framework
ReadabilityAnnotator	Assign a set of popular readability scores to the text.	DKPro Core (UIMA)

Component

Description

Framework

ReadabilityAnnotator

Assign a set of popular readability scores to the text.

DKPro Core (UIMA)

SRL (2)

Component	Description	Framework
ClearNlpSemanticRoleLabeler	ClearNLP semantic role labeller.	DKPro Core (UIMA)
MateSemanticRoleLabeler	DKPro Annotator for the MateTools Semantic Role Labeler.	DKPro Core (UIMA)

Component

Description

Framework

ClearNlpSemanticRoleLabeler

ClearNLP semantic role labeller.

DKPro Core (UIMA)

MateSemanticRoleLabeler

DKPro Annotator for the MateTools Semantic Role Labeler.

DKPro Core (UIMA)

Scripted analytics (6)

Component	Description	Framework
Groovy scripting PR	Runs a Groovy script as a processing resource	GATE
JAPE Transducer	A module for executing Jape grammars.	GATE
JAPE-Plus Transducer	An optimised, JAPE-compatible transducer.	GATE
RunProlog	Runs a Prolog program with the corpus data structure encoded as facts.	AlvisNLP
Script	Runs a script.	AlvisNLP
UIMA Analysis Engine	Wrapper for a Text Analysis Engine from UIMA.	GATE

Component

Description

Framework

Groovy scripting PR

Runs a Groovy script as a processing resource

GATE

JAPE Transducer

A module for executing Jape grammars.

GATE

JAPE-Plus Transducer

An optimised, JAPE-compatible transducer.

GATE

RunProlog

Runs a Prolog program with the corpus data structure encoded as facts.

AlvisNLP

Script

Runs a script.

AlvisNLP

UIMA Analysis Engine

Wrapper for a Text Analysis Engine from UIMA.

GATE

Segmenter (55)

Component	Description	Framework
ANNIE English Tokeniser	A customisable English tokeniser.	GATE
ANNIE Sentence Splitter	ANNIE sentence splitter.	GATE
Arabic Tokeniser	A customisable English tokeniser.	GATE
ArktweetTokenizer	ArkTweet tokenizer.	DKPro Core (UIMA)
Banner Base Tokenizer	Tokens returned by this class consist primarily of contiguous alphanumeric characters or single punctuation marks, however certain constructs such * as real numbers, percentages are recognized and returned as a single token.	NaCTeM (UIMA)
Banner Simple Tokenizer	Tokens ouput by this tokenizer consist of a contiguous block of alphanumeric characters or a single punctuation mark.	NaCTeM (UIMA)
Banner Whitespace Tokenizer	* Instances of this class tokenize {@link Sentence}s only at whitespace characters.	NaCTeM (UIMA)
BreakIteratorSegmenter	BreakIterator segmenter.	DKPro Core (UIMA)
Cafetiere Sentence Splitter	Uses a set of heuristics and patterns to find sentence boundaries.	NaCTeM (UIMA)
CamelCaseTokenSegmenter	Split up existing tokens again if they are camel-case text.	DKPro Core (UIMA)
Cebuano Gazetteer Tokeniser	A list lookup component.	GATE
Cebuano Tokeniser	A customisable English tokeniser.	GATE
Chinese Segmenter PR	Segment the Chinese text into words, based on the PAUM learning algorithm.	GATE
ClearNlpSegmenter	Tokenizer using Clear NLP.	DKPro Core (UIMA)
CompoundAnnotator	Annotates compound parts and linking morphemes.	DKPro Core (UIMA)
Freeling Sentence Splitter	Performs tokenisation.	NaCTeM (UIMA)
FreelingTokenizer	Performs tokenisation.	NaCTeM (UIMA)
GATE Unicode Tokeniser	A customisable Unicode tokeniser.	GATE
GENIA Sentence Splitter	A processing resource that takes document and corpus parameters	GATE
GENIA Sentence Splitter	Machine learning-based sentence splitter optimized for biomedical texts.	NaCTeM (UIMA)
Hashtag Tokenizer	Tokenizes Multi-Word Hashtags	GATE
Hindi Splitter	A Sentence Splitter.	GATE
Hindi Tokeniser	A customisable Hindi tokeniser.	GATE
_Sentence_and_Token_Segmentor,ILSP Paragraph, Sentence and Token Segmentor	This module is a regex and abbreviation based segmentor targetting texts written in Greek.	ILSP (UIMA)
IULATokenizer	Performs paragraph splitting, sentence splitting, and tokenisation.	NaCTeM (UIMA)
JTokSegmenter	JTok segmenter.	DKPro Core (UIMA)
LanguageToolSegmenter	Segmenter using LanguageTool to do the heavy lifting.	DKPro Core (UIMA)
LineBasedSentenceSegmenter	Annotates each line in the source text as a sentence.	DKPro Core (UIMA)
LingPipe Sentence Splitter	Sentence splitter based on LingPipe models.	NaCTeM (UIMA)
LingPipe Sentence Splitter PR	Provides an interface to LingPipe sentence splitter API.	GATE
LingPipe Tokenizer PR	Provides a LingPipe tokenizer.	GATE
MLRS Maltese Tokeniser	Tokenises Maltese text	NaCTeM (UIMA)
MLRS Paragraph Splitter	Identifies the paragraphs in the text, creating a Paragraph annotation for each one	NaCTeM (UIMA)
MLRS Sentence Splitter	Identifies the sentences in the text, creating a Sentence annotation for each	NaCTeM (UIMA)
OSCAR 4 Tokeniser	Segments text into tokens.	NaCTeM (UIMA)
OgmiosTokenizer	Tokenizes the sections contents according to the Ogmios tokenizer specifications.	AlvisNLP
OpenNLP Sentence Splitter	Sentence splitter using an OpenNLP maxent model	GATE
OpenNLP Tokenizer	Tokenizer using an OpenNLP maxent model	GATE
OpenNLPTokenizer	Tokenize the text and create token annotations that span the tokens.	NaCTeM (UIMA)
OpenNlpSegmenter	Tokenizer and sentence splitter using OpenNLP.	DKPro Core (UIMA)
ParagraphSplitter	This class creates paragraph annotations for the given input document.	DKPro Core (UIMA)
PatternBasedTokenSegmenter	Split up existing tokens again at particular split-chars.	DKPro Core (UIMA)
Penn BioTokenizer	Tokenizer for biomedical text	GATE
RASP2 Tokenizer	RASP2 Tokenizer.	GATE
RegEx Sentence Splitter	A sentence splitter based on regular expressions.	GATE
RegexTokenizer	This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.	DKPro Core (UIMA)
Romanian Tokeniser	A customisable Romanian tokeniser.	GATE
Stanford PTB Tokenizer	Stanford Penn Treebank v3 Tokenizer, for English	GATE
StanfordSegmenter	No description	DKPro Core (UIMA)
TokenMerger	Merges any Tokens that are covered by a given annotation type.	DKPro Core (UIMA)
TokenTrimmer	Remove prefixes and suffixes from tokens.	DKPro Core (UIMA)
TrailingCharacterRemover	Removing trailing character (sequences) from tokens, e.g. punctuation.	DKPro Core (UIMA)
Twitter Tokenizer (EN)	Tokenizer tuned for Tweets	GATE
UAICTokenizerDescriptor	No description	NaCTeM (UIMA)
WhitespaceTokenizer	A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.	DKPro Core (UIMA)

Component

Description

Framework

ANNIE English Tokeniser

A customisable English tokeniser.

GATE

ANNIE Sentence Splitter

ANNIE sentence splitter.

GATE

Arabic Tokeniser

A customisable English tokeniser.

GATE

ArktweetTokenizer

ArkTweet tokenizer.

DKPro Core (UIMA)

Banner Base Tokenizer

Tokens returned by this class consist primarily of contiguous alphanumeric characters or single punctuation marks, however certain constructs such * as real numbers, percentages are recognized and returned as a single token.

NaCTeM (UIMA)

Banner Simple Tokenizer

Tokens ouput by this tokenizer consist of a contiguous block of alphanumeric characters or a single punctuation mark.

NaCTeM (UIMA)

Banner Whitespace Tokenizer

* Instances of this class tokenize {@link Sentence}s only at whitespace characters.

NaCTeM (UIMA)

BreakIteratorSegmenter

BreakIterator segmenter.

DKPro Core (UIMA)

Cafetiere Sentence Splitter

Uses a set of heuristics and patterns to find sentence boundaries.

NaCTeM (UIMA)

CamelCaseTokenSegmenter

Split up existing tokens again if they are camel-case text.

DKPro Core (UIMA)

Cebuano Gazetteer Tokeniser

A list lookup component.

GATE

Cebuano Tokeniser

A customisable English tokeniser.

GATE

Chinese Segmenter PR

Segment the Chinese text into words, based on the PAUM learning algorithm.

GATE

ClearNlpSegmenter

Tokenizer using Clear NLP.

DKPro Core (UIMA)

CompoundAnnotator

Annotates compound parts and linking morphemes.

DKPro Core (UIMA)

Freeling Sentence Splitter

Performs tokenisation.

NaCTeM (UIMA)

FreelingTokenizer

Performs tokenisation.

NaCTeM (UIMA)

GATE Unicode Tokeniser

A customisable Unicode tokeniser.

GATE

A processing resource that takes document and corpus parameters

GATE

_Sentence_and_Token_Segmentor,ILSP Paragraph, Sentence and Token Segmentor

Machine learning-based sentence splitter optimized for biomedical texts.

NaCTeM (UIMA)

Hashtag Tokenizer

Tokenizes Multi-Word Hashtags

GATE

Hindi Splitter

A Sentence Splitter.

GATE

Hindi Tokeniser

A customisable Hindi tokeniser.

GATE

This module is a regex and abbreviation based segmentor targetting texts written in Greek.

ILSP (UIMA)

IULATokenizer

Performs paragraph splitting, sentence splitting, and tokenisation.

NaCTeM (UIMA)

JTokSegmenter

JTok segmenter.

DKPro Core (UIMA)

LanguageToolSegmenter

Segmenter using LanguageTool to do the heavy lifting.

DKPro Core (UIMA)

LineBasedSentenceSegmenter

Annotates each line in the source text as a sentence.

DKPro Core (UIMA)

LingPipe Sentence Splitter

Sentence splitter based on LingPipe models.

NaCTeM (UIMA)

LingPipe Sentence Splitter PR

Provides an interface to LingPipe sentence splitter API.

GATE

LingPipe Tokenizer PR

Provides a LingPipe tokenizer.

GATE

MLRS Maltese Tokeniser

Tokenises Maltese text

NaCTeM (UIMA)

MLRS Paragraph Splitter

Identifies the paragraphs in the text, creating a Paragraph annotation for each one

NaCTeM (UIMA)

MLRS Sentence Splitter

Identifies the sentences in the text, creating a Sentence annotation for each

NaCTeM (UIMA)

OSCAR 4 Tokeniser

Segments text into tokens.

NaCTeM (UIMA)

OgmiosTokenizer

Tokenizes the sections contents according to the Ogmios tokenizer specifications.

AlvisNLP

OpenNLP Sentence Splitter

Sentence splitter using an OpenNLP maxent model

GATE

OpenNLP Tokenizer

Tokenizer using an OpenNLP maxent model

GATE

OpenNLPTokenizer

Tokenize the text and create token annotations that span the tokens.

NaCTeM (UIMA)

OpenNlpSegmenter

Tokenizer and sentence splitter using OpenNLP.

DKPro Core (UIMA)

ParagraphSplitter

This class creates paragraph annotations for the given input document.

DKPro Core (UIMA)

PatternBasedTokenSegmenter

Split up existing tokens again at particular split-chars.

DKPro Core (UIMA)

Penn BioTokenizer

Tokenizer for biomedical text

GATE

RASP2 Tokenizer

RASP2 Tokenizer.

GATE

RegEx Sentence Splitter

A sentence splitter based on regular expressions.

GATE

RegexTokenizer

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

DKPro Core (UIMA)

Romanian Tokeniser

A customisable Romanian tokeniser.

GATE

Stanford PTB Tokenizer

Stanford Penn Treebank v3 Tokenizer, for English

GATE

StanfordSegmenter

No description

DKPro Core (UIMA)

TokenMerger

Merges any Tokens that are covered by a given annotation type.

DKPro Core (UIMA)

TokenTrimmer

Remove prefixes and suffixes from tokens.

DKPro Core (UIMA)

TrailingCharacterRemover

Removing trailing character (sequences) from tokens, e.g. punctuation.

DKPro Core (UIMA)

Twitter Tokenizer (EN)

Tokenizer tuned for Tweets

GATE

UAICTokenizerDescriptor

No description

NaCTeM (UIMA)

WhitespaceTokenizer

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

DKPro Core (UIMA)

Semantics (2)

Component	Description	Framework
Semantic Enrichment PR	The Semantic Enrichment PR allows adding new data to semantic annotations by querying external RDF (Linked Data) repositories.	GATE
SemanticFieldAnnotator	This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.	DKPro Core (UIMA)

Component

Description

Framework

Semantic Enrichment PR

The Semantic Enrichment PR allows adding new data to semantic annotations by querying external RDF (Linked Data) repositories.

GATE

SemanticFieldAnnotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.

DKPro Core (UIMA)

Sentiment (1)

Component	Description	Framework
Textalytics Sentiment Analysis	Textalytics Sentiment Analysis	GATE

Component

Description

Framework

Textalytics Sentiment Analysis

GATE

Spelling/Grammar (5)

Component	Description	Framework
CorrectionsContextualizer	This component assumes that some spell checker has already been applied upstream (e.g.	DKPro Core (UIMA)
JazzyChecker	This annotator uses Jazzy for the decision whether a word is spelled correctly or not.	DKPro Core (UIMA)
LanguageToolChecker	Detect grammatical errors in text using LanguageTool a rule based grammar checker.	DKPro Core (UIMA)
NorvigSpellingCorrector	Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.	DKPro Core (UIMA)
_Grammar_and_Style_Proofreading,Textalytics Spell, Grammar and Style Proofreading	Textalytics Spell, Grammar and Style Proofreading	GATE

Component

Description

Framework

CorrectionsContextualizer

This component assumes that some spell checker has already been applied upstream (e.g.

DKPro Core (UIMA)

JazzyChecker

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

DKPro Core (UIMA)

LanguageToolChecker

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

DKPro Core (UIMA)

NorvigSpellingCorrector

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

DKPro Core (UIMA)

_Grammar_and_Style_Proofreading,Textalytics Spell, Grammar and Style Proofreading

Textalytics Spell, Grammar and Style Proofreading

GATE

Stemmer (4)

Component	Description	Framework
BulStem	This plugin is an implementation of the BulStem stemmer algorithm for Bulgarian developed by Preslav Nakov.	GATE
PorterStemmer	synopsis	AlvisNLP
SnowballStemmer	UIMA wrapper for the Snowball stemmer.	DKPro Core (UIMA)
Stemmer PR	Wrapper for the Snowball stemmer.	GATE

Component

Description

Framework

BulStem

This plugin is an implementation of the BulStem stemmer algorithm for Bulgarian developed by Preslav Nakov.

GATE

PorterStemmer

synopsis

AlvisNLP

SnowballStemmer

UIMA wrapper for the Snowball stemmer.

DKPro Core (UIMA)

Stemmer PR

Wrapper for the Snowball stemmer.

GATE

Tagger (52)

Component	Description	Framework
ABNER Tagger	GATE wrapper over ABNER	GATE
ANNIE POS Tagger	Mark Hepple's Brill-style POS tagger	GATE
Anatomical Entity Tagger	Tags anatomical entities using Brown, UMLS and OBO Anatomy dictionary features	NaCTeM (UIMA)
ArktweetPosTagger	Wrapper for Twitter Tokenizer and POS Tagger.	DKPro Core (UIMA)
BANNER CRF Tagger	A UIMA wrapper for BANNER entity tagger.	NaCTeM (UIMA)
BioCreative Gene Mention Tagger	Tags Gene mentions using a model trained on BioCreative GM task data, with Entrez Gene and UMLS dictionary features.	NaCTeM (UIMA)
CCGPosTagger	Applies the CCG POS tagger on annotations.	AlvisNLP
CRF++ Tagger	Uses Conditional Random Fields model for labeling.	NaCTeM (UIMA)
Cebuano POS Tagger	Mark Hepple's Brill-style POS tagger, adapted for languages where entries are multiword	GATE
Chemistry Tagger	A tagger for chemical names.	GATE
ClearNlpPosTagger	Part-of-Speech annotator using Clear NLP.	DKPro Core (UIMA)
FreelingTagger	Performs tokenisation, lemmatisation and POS tagging.	NaCTeM (UIMA)
GENIA Tagger	Tags biological named entities: proteins, cell lines, cell types, DNAs, and RNAs.	NaCTeM (UIMA)
GenericTagger	The Generic Tagger is Generic!	GATE
GeniaTagger	Runs Genia Tagger on annotations.	AlvisNLP
Hepple POS Tagger	Mark Hepple's POS tagger, from dragontools/Banner toolkit.	NaCTeM (UIMA)
HepplePosTagger	GATE Hepple part-of-speech tagger.	DKPro Core (UIMA)
Hindi POS Tagger	Mark Hepple's Brill-style POS tagger, adapted for languages where entries are multiword	GATE
HunPosTagger	Part-of-Speech annotator using HunPos.	DKPro Core (UIMA)
ILSP FBT Tagger	ILSP FBT Tagger is an adaptation of the Brill tagger trained on Greek text.	ILSP (UIMA)
IULATagger	Performs paragraph splitting, sentence splitting, tokenisation and POS tagging.	NaCTeM (UIMA)
LingPipe POS Tagger PR	Provides a LingPipe part of speech tagger.	GATE
MateMorphTagger	DKPro Annotator for the MateToolsMorphTagger.	DKPro Core (UIMA)
MatePosTagger	DKPro Annotator for the MateToolsPosTagger	DKPro Core (UIMA)
MeCabTagger	Annotator for the MeCab Japanese POS Tagger.	DKPro Core (UIMA)
Measurement Tagger	A measurement tagger based upon GNU Units	GATE
Medical Condition Tagger	A tagger that recognises mentions of medical conditions.	NaCTeM (UIMA)
NormaGene Tagger	A processing resource that takes document and corpus parameters	GATE
Numbers Tagger	Finds numbers in (both words and digits) and annotates them with their numeric value	GATE
OpenCalais Tagger	An OpenCalais based semantic annotator	GATE
OpenNLP POS Tagger	POS Tagger using an OpenNLP maxent model	GATE
OpenNlpPosTagger	Part-of-Speech annotator using OpenNLP.	DKPro Core (UIMA)
POS Mapper	Map complex Russian morphology tags into simpler POS categories	GATE
Penn BioTagger	Ready-made application for the Penn BioTagger	GATE
Penn BioTagger: Genes	Penn BioTagger for Genes	GATE
Penn BioTagger: Malignancy	Penn BioTagger for malignancy types	GATE
Penn BioTagger: Variation	Penn BioTagger for variations	GATE
PosMapper	Maps existing POS tags from one tagset to another using a user provided properties file.	DKPro Core (UIMA)
RASP POS Converter	Converts from PennTreebank POS tags to the C2 tagset used by RASP.	GATE
RASP2 POS Tagger	RASP part-of-speech tagger, creating WordForm annotations	GATE
RfTagger	Rftagger morphological analyzer.	DKPro Core (UIMA)
Roman Numerals Tagger	Finds and annotates Roman numerals	GATE
Russian POS Tagger	Part-of-speech tagger for Russian	GATE
SVMLight Tagger	Applies an SVMLight-trained model on instances.	NaCTeM (UIMA)
Species Tagger	Tags species	NaCTeM (UIMA)
Stanford POS Tagger	Stanford Part-of-Speech Tagger	GATE
StanfordPosTagger	Stanford Part-of-Speech tagger component.	DKPro Core (UIMA)
Stepp Tagger	No description	NaCTeM (UIMA)
TreeTagger	Runs tree-tagger.	AlvisNLP
TreeTaggerPosTagger	Part-of-Speech and lemmatizer annotator using TreeTagger.	DKPro Core (UIMA)
Twitter POS Tagger (EN)	Stanford POS tagger trained on Tweets	GATE
UaicPosTagger	Carries out sentence splitting, tokenisation, POS tagging and lemmatitisation on plain text.	NaCTeM (UIMA)

Component

Description

Framework

ABNER Tagger

GATE wrapper over ABNER

GATE

ANNIE POS Tagger

Mark Hepple's Brill-style POS tagger

GATE

Anatomical Entity Tagger

Tags anatomical entities using Brown, UMLS and OBO Anatomy dictionary features

NaCTeM (UIMA)

ArktweetPosTagger

Wrapper for Twitter Tokenizer and POS Tagger.

DKPro Core (UIMA)

BANNER CRF Tagger

A UIMA wrapper for BANNER entity tagger.

NaCTeM (UIMA)

BioCreative Gene Mention Tagger

Tags Gene mentions using a model trained on BioCreative GM task data, with Entrez Gene and UMLS dictionary features.

NaCTeM (UIMA)

CCGPosTagger

Applies the CCG POS tagger on annotations.

AlvisNLP

CRF++ Tagger

Uses Conditional Random Fields model for labeling.

NaCTeM (UIMA)

Cebuano POS Tagger

Mark Hepple's Brill-style POS tagger, adapted for languages where entries are multiword

GATE

Chemistry Tagger

A tagger for chemical names.

GATE

ClearNlpPosTagger

Part-of-Speech annotator using Clear NLP.

DKPro Core (UIMA)

FreelingTagger

Performs tokenisation, lemmatisation and POS tagging.

NaCTeM (UIMA)

GENIA Tagger

Tags biological named entities: proteins, cell lines, cell types, DNAs, and RNAs.

NaCTeM (UIMA)

GenericTagger

The Generic Tagger is Generic!

GATE

GeniaTagger

Runs Genia Tagger on annotations.

AlvisNLP

Hepple POS Tagger

Mark Hepple's POS tagger, from dragontools/Banner toolkit.

NaCTeM (UIMA)

HepplePosTagger

GATE Hepple part-of-speech tagger.

DKPro Core (UIMA)

Hindi POS Tagger

Mark Hepple's Brill-style POS tagger, adapted for languages where entries are multiword

GATE

HunPosTagger

Part-of-Speech annotator using HunPos.

DKPro Core (UIMA)

ILSP FBT Tagger

ILSP FBT Tagger is an adaptation of the Brill tagger trained on Greek text.

ILSP (UIMA)

IULATagger

Performs paragraph splitting, sentence splitting, tokenisation and POS tagging.

NaCTeM (UIMA)

LingPipe POS Tagger PR

Provides a LingPipe part of speech tagger.

GATE

MateMorphTagger

DKPro Annotator for the MateToolsMorphTagger.

DKPro Core (UIMA)

MatePosTagger

DKPro Annotator for the MateToolsPosTagger

DKPro Core (UIMA)

MeCabTagger

Annotator for the MeCab Japanese POS Tagger.

DKPro Core (UIMA)

Measurement Tagger

A measurement tagger based upon GNU Units

GATE

Medical Condition Tagger

A tagger that recognises mentions of medical conditions.

NaCTeM (UIMA)

NormaGene Tagger

A processing resource that takes document and corpus parameters

GATE

Numbers Tagger

Finds numbers in (both words and digits) and annotates them with their numeric value

GATE

OpenCalais Tagger

An OpenCalais based semantic annotator

GATE

OpenNLP POS Tagger

POS Tagger using an OpenNLP maxent model

GATE

OpenNlpPosTagger

Part-of-Speech annotator using OpenNLP.

DKPro Core (UIMA)

POS Mapper

Map complex Russian morphology tags into simpler POS categories

GATE

Penn BioTagger

Ready-made application for the Penn BioTagger

GATE

Penn BioTagger: Genes

Penn BioTagger for Genes

GATE

Penn BioTagger: Malignancy

Penn BioTagger for malignancy types

GATE

Penn BioTagger: Variation

Penn BioTagger for variations

GATE

PosMapper

Maps existing POS tags from one tagset to another using a user provided properties file.

DKPro Core (UIMA)

RASP POS Converter

Converts from PennTreebank POS tags to the C2 tagset used by RASP.

GATE

RASP2 POS Tagger

RASP part-of-speech tagger, creating WordForm annotations

GATE

RfTagger

Rftagger morphological analyzer.

DKPro Core (UIMA)

Roman Numerals Tagger

Finds and annotates Roman numerals

GATE

Russian POS Tagger

Part-of-speech tagger for Russian

GATE

SVMLight Tagger

Applies an SVMLight-trained model on instances.

NaCTeM (UIMA)

Species Tagger

Tags species

NaCTeM (UIMA)

Stanford POS Tagger

Stanford Part-of-Speech Tagger

GATE

StanfordPosTagger

Stanford Part-of-Speech tagger component.

DKPro Core (UIMA)

Stepp Tagger

No description

NaCTeM (UIMA)

TreeTagger

Runs tree-tagger.

AlvisNLP

TreeTaggerPosTagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

DKPro Core (UIMA)

Twitter POS Tagger (EN)

Stanford POS tagger trained on Tweets

GATE

UaicPosTagger

Carries out sentence splitting, tokenisation, POS tagging and lemmatitisation on plain text.

NaCTeM (UIMA)

Topics (3)

Component	Description	Framework
MalletTopicModelEstimator	Estimate an LDA topic model using Mallet and write it to a file.	DKPro Core (UIMA)
MalletTopicModelInferencer	Infers the topic distribution over documents using a Mallet ParallelTopicModel.	DKPro Core (UIMA)
Textalytics Topics Extraction	Textalytics Topics Extraction	GATE

Component

Description

Framework

MalletTopicModelEstimator

Estimate an LDA topic model using Mallet and write it to a file.

DKPro Core (UIMA)

MalletTopicModelInferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

DKPro Core (UIMA)

Textalytics Topics Extraction

GATE

Validation (1)

Component	Description	Framework
Schema Enforcer	Produces an annotation set whose content is restricted by the specified set of schemas	GATE

Component

Description

Framework

Schema Enforcer

Produces an annotation set whose content is restricted by the specified set of schemas

GATE

Viewer/Editor (18)

Component	Description	Framework
Compound Document Editor	Editor for compound documents.	GATE
GATE Ontology Editor	Ontology editing tool.	GATE
GAZE	Gazetteer viewer and editor	GATE
Gazetteer Editor	Gazetteer viewer and editor.	GATE
JAPE-Plus Viewer	A JAPE grammar file viewer	GATE
Jape Viewer	A JAPE grammar file viewer	GATE
OAT	Ontology Annotation Tool.	GATE
Pairbank Viewer	viewer for the TermRaider Pairbank	GATE
RAT-C	Relation Annotation Tool Class view.	GATE
RAT-I	Relation Annotation Tool Instance view.	GATE
Schema Annotations Editor	An annotation editor restricted by schemas.	GATE
Script Editor	Editor for the Groovy script behind this PR	GATE
Shell	Starts an interactive shell that allows to query the corpus data structure.	AlvisNLP
Shell2	Starts an interactive shell that allows to query the corpus data structure.	AlvisNLP
Simple Schema Viewer	A Simple Annotation Schema Viewer	GATE
Syntax tree viewer	Viewer for syntax trees generated by a parser.	GATE
Termbank Viewer	viewer for the TermRaider Termbank	GATE
WordNet Viewer	WordNet viewer	GATE

Component

Description

Framework

Compound Document Editor

Editor for compound documents.

GATE

GATE Ontology Editor

Ontology editing tool.

GATE

GAZE

Gazetteer viewer and editor

GATE

Gazetteer Editor

Gazetteer viewer and editor.

GATE

JAPE-Plus Viewer

A JAPE grammar file viewer

GATE

Jape Viewer

A JAPE grammar file viewer

GATE

OAT

Ontology Annotation Tool.

GATE

Pairbank Viewer

viewer for the TermRaider Pairbank

GATE

RAT-C

Relation Annotation Tool Class view.

GATE

RAT-I

Relation Annotation Tool Instance view.

GATE

Schema Annotations Editor

An annotation editor restricted by schemas.

GATE

Script Editor

Editor for the Groovy script behind this PR

GATE

Shell

Starts an interactive shell that allows to query the corpus data structure.

AlvisNLP

Shell2

Starts an interactive shell that allows to query the corpus data structure.

AlvisNLP

Simple Schema Viewer

A Simple Annotation Schema Viewer

GATE

Syntax tree viewer

Viewer for syntax trees generated by a parser.

GATE

Termbank Viewer

viewer for the TermRaider Termbank

GATE

WordNet Viewer

WordNet viewer

GATE

Analytics by product

(original) AlvisNLP (52)

The components listed here could not be associated with a known third-party tool collection and are assumed to be original components.

Component	Description	Framework
Ab3P	synopsis	AlvisNLP
Action	Applies action expressions on selected elements.	AlvisNLP
AggregateValues	synopsis	AlvisNLP
AlvisREPrepareCrossValidation	synopsis	AlvisNLP
AnchorTuples	Creates tuples with a common argument.	AlvisNLP
AntecedentChoice	Biotopes-specific module: chooses an antecedent.	AlvisNLP
Assert	Tests an assertion on specified elements.	AlvisNLP
AttestedTermsProjector	Projects a list of terms given in tree-tagger format.	AlvisNLP
CartesianProductTuples	Creates tuples for each element of a Cartesian product.	AlvisNLP
CompareElements	Compares two sets of elements.	AlvisNLP
DisambiguateAlternatives	Disambiguate features that have multiple values.	AlvisNLP
ElementMapper	Maps elements according to a collection of mapping elements.	AlvisNLP
ElementProjector	Searches for entries in a dictionary generated by an expression.	AlvisNLP
ElementProjector2	synopsis	AlvisNLP
FileMapper	Maps the value of an annoation feature according to a mapping file.	AlvisNLP
FileMapper2	Maps elements according to a tab-separated mapping file.	AlvisNLP
InsertContents	synopsis	AlvisNLP
KeywordsSelector	Selects most relevant keywords in documents.	AlvisNLP
LayerComparator	Compares annotations in two different layers.	AlvisNLP
MergeLayers	Creates a new layer in each section containing all annotations in source layers.	AlvisNLP
MergeSections	Merge several sections into a single one.	AlvisNLP
NGrams	Computes annotation n-grams.	AlvisNLP
NewCount	Counts element occurrences and writes the results in a file, including tfidf.	AlvisNLP
OBOMapper	synopsis	AlvisNLP
OBOProjector	Projects OBO terms and synonyms on sections.	AlvisNLP
OntoReif	synopsis	AlvisNLP
PatternMatcher	Matches a regular expression-like pattern on the sequence of annotations in a given layer.	AlvisNLP
ProminentConceptReporter	synopsis	AlvisNLP
QuickHTML	synopsis	AlvisNLP
RegExp	Matches a regular expression on sections contents and create an annotation for each match.	AlvisNLP
RemoveContents	synopsis	AlvisNLP
RemoveEquivalent	Removes duplicate elements.	AlvisNLP
RemoveOverlaps	Removes overlapping annotations from a given layer.	AlvisNLP
RunProlog	Runs a Prolog program with the corpus data structure encoded as facts.	AlvisNLP
SQLImport	synopsis	AlvisNLP
Script	Runs a script.	AlvisNLP
SeSMig	Detects sentence boundaries and creates one annotation for each sentence.This module assumes WoSMig processed the same sections.	AlvisNLP
SelectingElementClassifier	Searches for discrimminating attributes with Weka.	AlvisNLP
Sequence_Impl	Sequence of modules.	AlvisNLP
Shell	Starts an interactive shell that allows to query the corpus data structure.	AlvisNLP
Shell2	Starts an interactive shell that allows to query the corpus data structure.	AlvisNLP
SimpleProjector	Projects a simple dictionary on sections.	AlvisNLP
SimpleProjector2	Projects a simple dictionary on sections.	AlvisNLP
SplitOverlaps	Splits overlapping annotations.	AlvisNLP
TaggingElementClassifier	Classifies elements with a Weka classifier.	AlvisNLP
TomapProjector	synopsis	AlvisNLP
TomapTrain	synopsis	AlvisNLP
TrainingElementClassifier	Trains a Weka classifier where examples are elements.	AlvisNLP
TyDIProjector	Projects terms from a TiDI export.	AlvisNLP
WapitiLabel	synopsis	AlvisNLP
WapitiTrain	synopsis	AlvisNLP
WoSMig	Performs word segmentation on section contents.	AlvisNLP

Component

Description

Framework

Ab3P

synopsis

AlvisNLP

Action

Applies action expressions on selected elements.

AlvisNLP

AggregateValues

synopsis

AlvisNLP

AlvisREPrepareCrossValidation

synopsis

AlvisNLP

AnchorTuples

Creates tuples with a common argument.

AlvisNLP

AntecedentChoice

Biotopes-specific module: chooses an antecedent.

AlvisNLP

Assert

Tests an assertion on specified elements.

AlvisNLP

AttestedTermsProjector

Projects a list of terms given in tree-tagger format.

AlvisNLP

CartesianProductTuples

Creates tuples for each element of a Cartesian product.

AlvisNLP

CompareElements

Compares two sets of elements.

AlvisNLP

DisambiguateAlternatives

Disambiguate features that have multiple values.

AlvisNLP

ElementMapper

Maps elements according to a collection of mapping elements.

AlvisNLP

ElementProjector

Searches for entries in a dictionary generated by an expression.

AlvisNLP

ElementProjector2

synopsis

AlvisNLP

FileMapper

Maps the value of an annoation feature according to a mapping file.

AlvisNLP

FileMapper2

Maps elements according to a tab-separated mapping file.

AlvisNLP

InsertContents

synopsis

AlvisNLP

KeywordsSelector

Selects most relevant keywords in documents.

AlvisNLP

LayerComparator

Compares annotations in two different layers.

AlvisNLP

MergeLayers

Creates a new layer in each section containing all annotations in source layers.

AlvisNLP

MergeSections

Merge several sections into a single one.

AlvisNLP

NGrams

Computes annotation n-grams.

AlvisNLP

NewCount

Counts element occurrences and writes the results in a file, including tfidf.

AlvisNLP

OBOMapper

synopsis

AlvisNLP

OBOProjector

Projects OBO terms and synonyms on sections.

AlvisNLP

OntoReif

synopsis

AlvisNLP

PatternMatcher

Matches a regular expression-like pattern on the sequence of annotations in a given layer.

AlvisNLP

ProminentConceptReporter

synopsis

AlvisNLP

QuickHTML

synopsis

AlvisNLP

RegExp

Matches a regular expression on sections contents and create an annotation for each match.

AlvisNLP

RemoveContents

synopsis

AlvisNLP

RemoveEquivalent

Removes duplicate elements.

AlvisNLP

RemoveOverlaps

Removes overlapping annotations from a given layer.

AlvisNLP

RunProlog

Runs a Prolog program with the corpus data structure encoded as facts.

AlvisNLP

SQLImport

synopsis

AlvisNLP

Script

Runs a script.

AlvisNLP

SeSMig

Detects sentence boundaries and creates one annotation for each sentence.This module assumes WoSMig processed the same sections.

AlvisNLP

SelectingElementClassifier

Searches for discrimminating attributes with Weka.

AlvisNLP

Sequence_Impl

Sequence of modules.

AlvisNLP

Shell

Starts an interactive shell that allows to query the corpus data structure.

AlvisNLP

Shell2

Starts an interactive shell that allows to query the corpus data structure.

AlvisNLP

SimpleProjector

Projects a simple dictionary on sections.

AlvisNLP

SimpleProjector2

Projects a simple dictionary on sections.

AlvisNLP

SplitOverlaps

Splits overlapping annotations.

AlvisNLP

TaggingElementClassifier

Classifies elements with a Weka classifier.

AlvisNLP

TomapProjector

synopsis

AlvisNLP

TomapTrain

synopsis

AlvisNLP

TrainingElementClassifier

Trains a Weka classifier where examples are elements.

AlvisNLP

TyDIProjector

Projects terms from a TiDI export.

AlvisNLP

WapitiLabel

synopsis

AlvisNLP

WapitiTrain

synopsis

AlvisNLP

WoSMig

Performs word segmentation on section contents.

AlvisNLP

(original) DKPro Core (UIMA) (52)

The components listed here could not be associated with a known third-party tool collection and are assumed to be original components.

Component	Description	Framework
AnnotationByLengthFilter	Removes annotations that do not conform to minimum or maximum length constraints.	DKPro Core (UIMA)
AnnotationByTextFilter	Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.	DKPro Core (UIMA)
ApplyChangesAnnotator	Applies changes annotated using a SofaChangeAnnotation.	DKPro Core (UIMA)
AssertAnnotations$InternalJCasHolder	Descriptor automatically generated by uimaFIT	DKPro Core (UIMA)
Backmapper	After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.	DKPro Core (UIMA)
BerkeleyParser	Berkeley Parser annotator .	DKPro Core (UIMA)
CamelCaseTokenSegmenter	Split up existing tokens again if they are camel-case text.	DKPro Core (UIMA)
CapitalizationNormalizer	Takes a text and replaces wrong capitalization	DKPro Core (UIMA)
ColognePhoneticTranscriptor	Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.	DKPro Core (UIMA)
CompoundAnnotator	Annotates compound parts and linking morphemes.	DKPro Core (UIMA)
CorrectionsContextualizer	This component assumes that some spell checker has already been applied upstream (e.g.	DKPro Core (UIMA)
DependencyDumper	Dump dependencies to screen.	DKPro Core (UIMA)
DictionaryAnnotator	Takes a plain text file with phrases as input and annotates the phrases in the CAS file.	DKPro Core (UIMA)
DictionaryBasedTokenTransformer	Reads a tab-separated file containing mappings from one token to another.	DKPro Core (UIMA)
DocumentMetaDataStripper	Removes fields from the document meta data which may be different depending on the machine a test is run on.	DKPro Core (UIMA)
DoubleMetaphonePhoneticTranscriptor	Double-Metaphone phonetic transcription based on Apache Commons Codec.	DKPro Core (UIMA)
ExpressiveLengtheningNormalizer	Takes a text and shortens extra long words	DKPro Core (UIMA)
FileBasedTokenTransformer	Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.	DKPro Core (UIMA)
GateLemmatizer	Wrapper for the GATE rule based lemmatizer.	DKPro Core (UIMA)
GermanSeparatedParticleAnnotator	Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.	DKPro Core (UIMA)
HyphenationRemover	Simple dictionary-based hyphenation remover.	DKPro Core (UIMA)
IOTestRunner$Validator	Descriptor automatically generated by uimaFIT	DKPro Core (UIMA)
JCasHolder	Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.	DKPro Core (UIMA)
LineBasedSentenceSegmenter	Annotates each line in the source text as a sentence.	DKPro Core (UIMA)
MalletTopicModelEstimator	Estimate an LDA topic model using Mallet and write it to a file.	DKPro Core (UIMA)
MateParser	DKPro Annotator for the MateToolsParser.	DKPro Core (UIMA)
MetaphonePhoneticTranscriptor	Metaphone phonetic transcription based on Apache Commons Codec.	DKPro Core (UIMA)
MstParser	Dependency parsing using MSTParser.	DKPro Core (UIMA)
NGramAnnotator	N-gram annotator.	DKPro Core (UIMA)
NorvigSpellingCorrector	Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.	DKPro Core (UIMA)
ParagraphSplitter	This class creates paragraph annotations for the given input document.	DKPro Core (UIMA)
PatternBasedTokenSegmenter	Split up existing tokens again at particular split-chars.	DKPro Core (UIMA)
PosFilter	Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.	DKPro Core (UIMA)
PosMapper	Maps existing POS tags from one tagset to another using a user provided properties file.	DKPro Core (UIMA)
ReadabilityAnnotator	Assign a set of popular readability scores to the text.	DKPro Core (UIMA)
RegexBasedTokenTransformer	A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.	DKPro Core (UIMA)
RegexTokenFilter	Remove every token that does or does not match a given regular expression.	DKPro Core (UIMA)
RegexTokenizer	This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.	DKPro Core (UIMA)
ReplacementFileNormalizer	Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens	DKPro Core (UIMA)
SharpSNormalizer	Takes a text and replaces sharp s	DKPro Core (UIMA)
SoundexPhoneticTranscriptor	Soundex phonetic transcription based on Apache Commons Codec.	DKPro Core (UIMA)
SpellingNormalizer	Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.	DKPro Core (UIMA)
StopWordRemover	Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.	DKPro Core (UIMA)
Stopwatch	Can be used to measure how long the processing between two points in a pipeline takes.	DKPro Core (UIMA)
TagsetDescriptionStripper	Copyright 2012 Ubiquitous Knowledge Processing (UKP) Lab Technische Universität Darmstadt Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.	DKPro Core (UIMA)
TfidfAnnotator	This component adds Tfidf annotations consisting of a term and a tfidf weight.	DKPro Core (UIMA)
TokenCaseTransformer	Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.	DKPro Core (UIMA)
TokenMerger	Merges any Tokens that are covered by a given annotation type.	DKPro Core (UIMA)
TokenTrimmer	Remove prefixes and suffixes from tokens.	DKPro Core (UIMA)
TrailingCharacterRemover	Removing trailing character (sequences) from tokens, e.g. punctuation.	DKPro Core (UIMA)
UmlautNormalizer	Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.	DKPro Core (UIMA)
WhitespaceTokenizer	A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.	DKPro Core (UIMA)

Component

Description

Framework

AnnotationByLengthFilter

Removes annotations that do not conform to minimum or maximum length constraints.

DKPro Core (UIMA)

AnnotationByTextFilter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

DKPro Core (UIMA)

ApplyChangesAnnotator

Applies changes annotated using a SofaChangeAnnotation.

DKPro Core (UIMA)

AssertAnnotations$InternalJCasHolder

Descriptor automatically generated by uimaFIT

DKPro Core (UIMA)

Backmapper

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

DKPro Core (UIMA)

BerkeleyParser

Berkeley Parser annotator .

DKPro Core (UIMA)

CamelCaseTokenSegmenter

Split up existing tokens again if they are camel-case text.

DKPro Core (UIMA)

CapitalizationNormalizer

Takes a text and replaces wrong capitalization

DKPro Core (UIMA)

ColognePhoneticTranscriptor

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.

DKPro Core (UIMA)

CompoundAnnotator

Annotates compound parts and linking morphemes.

DKPro Core (UIMA)

CorrectionsContextualizer

This component assumes that some spell checker has already been applied upstream (e.g.

DKPro Core (UIMA)

DependencyDumper

Dump dependencies to screen.

DKPro Core (UIMA)

DictionaryAnnotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file.

DKPro Core (UIMA)

DictionaryBasedTokenTransformer

Reads a tab-separated file containing mappings from one token to another.

DKPro Core (UIMA)

DocumentMetaDataStripper

Removes fields from the document meta data which may be different depending on the machine a test is run on.

DKPro Core (UIMA)

DoubleMetaphonePhoneticTranscriptor

Double-Metaphone phonetic transcription based on Apache Commons Codec.

DKPro Core (UIMA)

ExpressiveLengtheningNormalizer

Takes a text and shortens extra long words

DKPro Core (UIMA)

FileBasedTokenTransformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

DKPro Core (UIMA)

GateLemmatizer

Wrapper for the GATE rule based lemmatizer.

DKPro Core (UIMA)

GermanSeparatedParticleAnnotator

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.

DKPro Core (UIMA)

HyphenationRemover

Simple dictionary-based hyphenation remover.

DKPro Core (UIMA)

IOTestRunner$Validator

Descriptor automatically generated by uimaFIT

DKPro Core (UIMA)

JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

DKPro Core (UIMA)

LineBasedSentenceSegmenter

Annotates each line in the source text as a sentence.

DKPro Core (UIMA)

MalletTopicModelEstimator

Estimate an LDA topic model using Mallet and write it to a file.

DKPro Core (UIMA)

MateParser

DKPro Annotator for the MateToolsParser.

DKPro Core (UIMA)

MetaphonePhoneticTranscriptor

Metaphone phonetic transcription based on Apache Commons Codec.

DKPro Core (UIMA)

MstParser

Dependency parsing using MSTParser.

DKPro Core (UIMA)

NGramAnnotator

N-gram annotator.

DKPro Core (UIMA)

NorvigSpellingCorrector

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

DKPro Core (UIMA)

ParagraphSplitter

This class creates paragraph annotations for the given input document.

DKPro Core (UIMA)

PatternBasedTokenSegmenter

Split up existing tokens again at particular split-chars.

DKPro Core (UIMA)

PosFilter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

DKPro Core (UIMA)

PosMapper

Maps existing POS tags from one tagset to another using a user provided properties file.

DKPro Core (UIMA)

ReadabilityAnnotator

Assign a set of popular readability scores to the text.

DKPro Core (UIMA)

RegexBasedTokenTransformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

DKPro Core (UIMA)

RegexTokenFilter

Remove every token that does or does not match a given regular expression.

DKPro Core (UIMA)

RegexTokenizer

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

DKPro Core (UIMA)

ReplacementFileNormalizer

Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens

DKPro Core (UIMA)

SharpSNormalizer

Takes a text and replaces sharp s

DKPro Core (UIMA)

SoundexPhoneticTranscriptor

Soundex phonetic transcription based on Apache Commons Codec.

DKPro Core (UIMA)

SpellingNormalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

DKPro Core (UIMA)

StopWordRemover

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.

DKPro Core (UIMA)

Stopwatch

Can be used to measure how long the processing between two points in a pipeline takes.

DKPro Core (UIMA)

TagsetDescriptionStripper

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.

DKPro Core (UIMA)

TfidfAnnotator

This component adds Tfidf annotations consisting of a term and a tfidf weight.

DKPro Core (UIMA)

TokenCaseTransformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

DKPro Core (UIMA)

TokenMerger

Merges any Tokens that are covered by a given annotation type.

DKPro Core (UIMA)

TokenTrimmer

Remove prefixes and suffixes from tokens.

DKPro Core (UIMA)

TrailingCharacterRemover

Removing trailing character (sequences) from tokens, e.g. punctuation.

DKPro Core (UIMA)

UmlautNormalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

DKPro Core (UIMA)

WhitespaceTokenizer

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

DKPro Core (UIMA)

(original) GATE (135)

The components listed here could not be associated with a known third-party tool collection and are assumed to be original components.

Component	Description	Framework
ANNIE English Tokeniser	A customisable English tokeniser.	GATE
ANNIE Gazetteer	A list lookup component.	GATE
ANNIE NE Transducer	ANNIE named entity grammar.	GATE
ANNIE Nominal Coreferencer	Nominal Coreference resolution component	GATE
ANNIE OrthoMatcher	ANNIE orthographical coreference component.	GATE
ANNIE Pronominal Coreferencer	Pronominal Coreference resolution component.	GATE
ANNIE Sentence Splitter	ANNIE sentence splitter.	GATE
ANNIE VP Chunker	ANNIE VP Chunker component.	GATE
ANNIE+Measurements	Ready-made application for ANNIE plus the measurement tagger	GATE
Annotation Merging PR	Merge Annotations from different annotators.	GATE
Annotation Set Transfer	Annotation set transfer component.	GATE
Arabic Gazetteer	A list lookup component.	GATE
Arabic Gazetteer Collector	No description	GATE
Arabic IE System	Ready-made Arabic IE application	GATE
Arabic Infered Gazetteer	A list lookup component.	GATE
Arabic Main Grammar	A module for executing Jape grammars.	GATE
Arabic OrthoMatcher	ANNIE orthographical coreference component.	GATE
Arabic Tokeniser	A customisable English tokeniser.	GATE
BDM Computation PR	Compute BDM score for each pair of concepts in the given ontology.	GATE
Batch Learning PR	Supports training, application and evaluation of machine learning models for NLP tasks	GATE
Boilerpipe Content Detection	Uses boilerpipe to determine which sections of a document are interesting content and which are just boilerplate	GATE
CSV Corpus Populater	Populate a corpus from CSV files	GATE
Cebuano Gazetteer	A list lookup component.	GATE
Cebuano Gazetteer Tokeniser	A list lookup component.	GATE
Cebuano IE System	Ready-made Cebuano IE application	GATE
Cebuano Tokeniser	A customisable English tokeniser.	GATE
Cebuano Transducer	A module for executing Jape grammars.	GATE
Cebuano Transducer Postprocessor	A module for executing Jape grammars.	GATE
Chemistry Tagger	A tagger for chemical names.	GATE
Chinese IE System	Ready-made Chinese IE application	GATE
Chinese Segmenter PR	Segment the Chinese text into words, based on the PAUM learning algorithm.	GATE
Combine Members PR	Combines documents in a composite document.	GATE
Compound Document	GATE Compound Document.	GATE
Compound Document Editor	Editor for compound documents.	GATE
Compound Document From Xml	GATE Compound Document.	GATE
ConnectSesameOntology	Connect to a repository containing and ontology	GATE
Control Script	Editor for the Groovy script controlling a scriptable controller	GATE
Copy Anns to Another Doc PR	Copy the annotations from one document to another document.	GATE
Corpus Indexing Support	No description	GATE
Crawler PR	GATE implementation of the Websphinx crawling API	GATE
CreateSesameOntology	Create a ontology from a Sesame configuration file for a repository	GATE
Date Annotation Normalizer	provides normalized values for all existing date annotations	GATE
Date Normalizer	provides normalized values for all known dates	GATE
Delete Member PR	Deletes one member document from a compound doc.	GATE
Document Reset PR	Remove named annotation sets or reset the default annotation set	GATE
Document normalizer	Normalize document content to remove "smart quotes" etc.	GATE
DocumentFrequencyBank	Document frequency counter derived from corpora and other DFBs	GATE
EDT Monitor	Warns whenever an AWT component is updated from anywhere other than the event dispatch thread	GATE
Flexible Gazetteer	A more flexible list lookup component.	GATE
French IE System	Ready-made French IE application	GATE
GATE Composite document	GATE Composite document.	GATE
GATE Morphological analyser	Morphological Analyzer for the English Language.	GATE
GATE Ontology Editor	Ontology editing tool.	GATE
GATE Unicode Tokeniser	A customisable Unicode tokeniser.	GATE
GAZE	Gazetteer viewer and editor	GATE
Gazetteer Editor	Gazetteer viewer and editor.	GATE
Gazetteer List Collector	Gazetteer lists collector.	GATE
GenericTagger	The Generic Tagger is Generic!	GATE
German IE System	Ready-made German IE application	GATE
Groovy scripting PR	Runs a Groovy script as a processing resource	GATE
Groovy support for GATE	No description	GATE
Hash Gazetteer	A list lookup component implemented by OntoText Lab.	GATE
Hashtag Tokenizer	Tokenizes Multi-Word Hashtags	GATE
Hindi Gazetteer	A list lookup component.	GATE
Hindi Main Grammar	A module for executing Jape grammars	GATE
Hindi OrthoMatcher	Hindi Orthomatcher	GATE
Hindi Splitter	A Sentence Splitter.	GATE
Hindi Tokeniser	A customisable Hindi tokeniser.	GATE
Hindi Tokeniser Gazetteer	A list lookup component.	GATE
Hindi Tokeniser Postprocessor	A module for executing Jape grammars	GATE
IAA Computation PR	Compute inter-annotator agreement (IAA).	GATE
Inflectional gazetteer	Gazetteer with support for inflectional morphology	GATE
JAPE Transducer	A module for executing Jape grammars.	GATE
JAPE-Plus Transducer	An optimised, JAPE-compatible transducer.	GATE
JAPE-Plus Viewer	A JAPE grammar file viewer	GATE
Jape Viewer	A JAPE grammar file viewer	GATE
Java Heap Dumper	Dumps the Java heap to the specified file	GATE
Large KB Gazetteer	KIM KB based alias-lookup commponent	GATE
Linguistic Simplifier	A processing resource that takes document and corpus parameters	GATE
Linguistic Simplifier	Example application for the linguistic simplifier	GATE
Log4J Level: ALL	Allows the Log4J log level to be set to ALL from within the GUI	GATE
Machine Learning PR	Trains a machine learning algorithm from a corpus.	GATE
Majority-vote consensus builder (annotation)	Process results of a crowd annotation task to find where annotators agree and disagree.	GATE
Majority-vote consensus builder (classification)	Process results of a crowd annotation task to find where annotators agree and disagree.	GATE
Measurement Tagger	A measurement tagger based upon GNU Units	GATE
Measurements	Ready-made application for measurement annotator	GATE
MetaMap Annotator	This plugin uses the MetaMap Java API to send GATE document content to MetaMap skrmedpostctl server and PrologBeans mmserver instances running on the given machine/port	GATE
Noun Phrase Chunker	Ready-made NP chunking application	GATE
Noun Phrase Chunker	Implementation of the Ramshaw and Marcus base noun phrase chunker	GATE
Numbers Tagger	Finds numbers in (both words and digits) and annotates them with their numeric value	GATE
OAT	Ontology Annotation Tool.	GATE
OWLIM Ontology	Ontology created as a temporary OWLIM3 in-memory repository	GATE
OWLIM Ontology DEPRECATED	Ontology created as a temporary OWLIM3 in-memory repository, for backwards compatibility only	GATE
Onto Root Gazetteer	An ontology lookup component	GATE
OntoGazetteer	A list lookup component based on mapping between ontology classes and gazetteer lists.	GATE
OrthoRef	An orthographic coreferencer	GATE
PMI Bank	Pointwise Mutual Information from corpora	GATE
PMI Example (English)	Example application for the PMI (pointwise mutual information) tool	GATE
POS Mapper	Map complex Russian morphology tags into simpler POS categories	GATE
Quality Assurance PR	The Quality Assurance PR provides a functionality of the Corpus QA Tool in GATE Developer	GATE
RAT-C	Relation Annotation Tool Class view.	GATE
RAT-I	Relation Annotation Tool Instance view.	GATE
RegEx Sentence Splitter	A sentence splitter based on regular expressions.	GATE
Roman Numerals Tagger	Finds and annotates Roman numerals	GATE
Romanian Gazetteer	A list lookup component.	GATE
Romanian IE System	Ready-made Romanian IE application	GATE
Romanian Tokeniser	A customisable Romanian tokeniser.	GATE
Romanian Transducer	A module for executing Jape grammars	GATE
RussIE	Basic version of the RussIE application	GATE
RussIE + Inflectional Gazetteer & OrthoMatcher	RussIE application with orthomatcher and inflexional gazetteer	GATE
RussIE + Inflectional Gazetter	RussIE application with inflexional gazetteer	GATE
RussIE + OrthoMatcher	RussIE application with orthomatcher	GATE
Russian Gazetteer	Customised version of the hash gazetteer	GATE
Russian POS Tagger	Part-of-speech tagger for Russian	GATE
Schema Annotations Editor	An annotation editor restricted by schemas.	GATE
Schema Enforcer	Produces an annotation set whose content is restricted by the specified set of schemas	GATE
Script Editor	Editor for the Groovy script behind this PR	GATE
Scriptable Controller	A controller whose execution strategy is controlled by a Groovy script	GATE
Search Results	Viewer for IR search results	GATE
SearchPR	Provides IR functionality.	GATE
Segment Processing PR	Processes individual segments as separate documents	GATE
Semantic Enrichment PR	The Semantic Enrichment PR allows adding new data to semantic annotations by querying external RDF (Linked Data) repositories.	GATE
Sharable Gazettee	A list lookup component.	GATE
Show/Hide Resources	Show resources that would otherwise be hidden, e.g. resources created for internal use by other resources	GATE
Simple Schema Viewer	A Simple Annotation Schema Viewer	GATE
Switch Member PR	Sets the focus of a compound document to a specified member document.	GATE
Syntax tree viewer	Viewer for syntax trees generated by a parser.	GATE
Termbank Score Copier	Copy scores from Termbanks back to their source annotations	GATE
Text Categorization PR	Classify text based on a semantic space	GATE
The Duplicator	Duplicate any resource with a right click menu option	GATE
Tweet Normaliser	Normalise texts in tweets (convert into standard English spelling mistakes, colloquialisms, typing variations and so on)	GATE
TwitIE (EN)	English TwitIE application	GATE
Twitter Tokenizer (EN)	Tokenizer tuned for Tweets	GATE
UIMA Analysis Engine	Wrapper for a Text Analysis Engine from UIMA.	GATE
Unload Unused Plugins	Unloads all plugins for which we cannot find any loaded instances	GATE

Component

Description

Framework

ANNIE English Tokeniser

A customisable English tokeniser.

GATE

ANNIE Gazetteer

A list lookup component.

GATE

ANNIE NE Transducer

ANNIE named entity grammar.

GATE

ANNIE Nominal Coreferencer

Nominal Coreference resolution component

GATE

ANNIE OrthoMatcher

ANNIE orthographical coreference component.

GATE

ANNIE Pronominal Coreferencer

Pronominal Coreference resolution component.

GATE

ANNIE Sentence Splitter

ANNIE sentence splitter.

GATE

ANNIE VP Chunker

ANNIE VP Chunker component.

GATE

ANNIE+Measurements

Ready-made application for ANNIE plus the measurement tagger

GATE

Annotation Merging PR

Merge Annotations from different annotators.

GATE

Annotation Set Transfer

Annotation set transfer component.

GATE

Arabic Gazetteer

A list lookup component.

GATE

Arabic Gazetteer Collector

No description

GATE

Arabic IE System

Ready-made Arabic IE application

GATE

Arabic Infered Gazetteer

A list lookup component.

GATE

Arabic Main Grammar

A module for executing Jape grammars.

GATE

Arabic OrthoMatcher

ANNIE orthographical coreference component.

GATE

Arabic Tokeniser

A customisable English tokeniser.

GATE

BDM Computation PR

Compute BDM score for each pair of concepts in the given ontology.

GATE

Batch Learning PR

Supports training, application and evaluation of machine learning models for NLP tasks

GATE

Boilerpipe Content Detection

Uses boilerpipe to determine which sections of a document are interesting content and which are just boilerplate

GATE

CSV Corpus Populater

Populate a corpus from CSV files

GATE

Cebuano Gazetteer

A list lookup component.

GATE

Cebuano Gazetteer Tokeniser

A list lookup component.

GATE

Cebuano IE System

Ready-made Cebuano IE application

GATE

Cebuano Tokeniser

A customisable English tokeniser.

GATE

Cebuano Transducer

A module for executing Jape grammars.

GATE

Cebuano Transducer Postprocessor

A module for executing Jape grammars.

GATE

Chemistry Tagger

A tagger for chemical names.

GATE

Chinese IE System

Ready-made Chinese IE application

GATE

Chinese Segmenter PR

Segment the Chinese text into words, based on the PAUM learning algorithm.

GATE

Combine Members PR

Combines documents in a composite document.

GATE

Compound Document

GATE Compound Document.

GATE

Compound Document Editor

Editor for compound documents.

GATE

Compound Document From Xml

GATE Compound Document.

GATE

ConnectSesameOntology

Connect to a repository containing and ontology

GATE

Control Script

Editor for the Groovy script controlling a scriptable controller

GATE

Copy Anns to Another Doc PR

Copy the annotations from one document to another document.

GATE

Corpus Indexing Support

No description

GATE

Crawler PR

GATE implementation of the Websphinx crawling API

GATE

CreateSesameOntology

Create a ontology from a Sesame configuration file for a repository

GATE

Date Annotation Normalizer

provides normalized values for all existing date annotations

GATE

Date Normalizer

provides normalized values for all known dates

GATE

Delete Member PR

Deletes one member document from a compound doc.

GATE

Document Reset PR

Remove named annotation sets or reset the default annotation set

GATE

Document normalizer

Normalize document content to remove "smart quotes" etc.

GATE

DocumentFrequencyBank

Document frequency counter derived from corpora and other DFBs

GATE

EDT Monitor

Warns whenever an AWT component is updated from anywhere other than the event dispatch thread

GATE

Flexible Gazetteer

A more flexible list lookup component.

GATE

French IE System

Ready-made French IE application

GATE

GATE Composite document

GATE Composite document.

GATE

GATE Morphological analyser

Morphological Analyzer for the English Language.

GATE

GATE Ontology Editor

Ontology editing tool.

GATE

GATE Unicode Tokeniser

A customisable Unicode tokeniser.

GATE

GAZE

Gazetteer viewer and editor

GATE

Gazetteer Editor

Gazetteer viewer and editor.

GATE

Gazetteer List Collector

Gazetteer lists collector.

GATE

GenericTagger

The Generic Tagger is Generic!

GATE

German IE System

Ready-made German IE application

GATE

Groovy scripting PR

Runs a Groovy script as a processing resource

GATE

Groovy support for GATE

No description

GATE

Hash Gazetteer

A list lookup component implemented by OntoText Lab.

GATE

Hashtag Tokenizer

Tokenizes Multi-Word Hashtags

GATE

Hindi Gazetteer

A list lookup component.

GATE

Hindi Main Grammar

A module for executing Jape grammars

GATE

Hindi OrthoMatcher

Hindi Orthomatcher

GATE

Hindi Splitter

A Sentence Splitter.

GATE

Hindi Tokeniser

A customisable Hindi tokeniser.

GATE

Hindi Tokeniser Gazetteer

A list lookup component.

GATE

Hindi Tokeniser Postprocessor

A module for executing Jape grammars

GATE

IAA Computation PR

Compute inter-annotator agreement (IAA).

GATE

Inflectional gazetteer

Gazetteer with support for inflectional morphology

GATE

JAPE Transducer

A module for executing Jape grammars.

GATE

JAPE-Plus Transducer

An optimised, JAPE-compatible transducer.

GATE

JAPE-Plus Viewer

A JAPE grammar file viewer

GATE

Jape Viewer

A JAPE grammar file viewer

GATE

Java Heap Dumper

Dumps the Java heap to the specified file

GATE

Large KB Gazetteer

KIM KB based alias-lookup commponent

GATE

A processing resource that takes document and corpus parameters

GATE

Majority-vote consensus builder (annotation)

Example application for the linguistic simplifier

GATE

Log4J Level: ALL

Allows the Log4J log level to be set to ALL from within the GUI

GATE

Machine Learning PR

Trains a machine learning algorithm from a corpus.

GATE

Process results of a crowd annotation task to find where annotators agree and disagree.

GATE

Majority-vote consensus builder (classification)

Process results of a crowd annotation task to find where annotators agree and disagree.

GATE

Measurement Tagger

A measurement tagger based upon GNU Units

GATE

Measurements

Ready-made application for measurement annotator

GATE

MetaMap Annotator

This plugin uses the MetaMap Java API to send GATE document content to MetaMap skrmedpostctl server and PrologBeans mmserver instances running on the given machine/port

GATE

Ready-made NP chunking application

GATE

OWLIM Ontology DEPRECATED

Implementation of the Ramshaw and Marcus base noun phrase chunker

GATE

Numbers Tagger

Finds numbers in (both words and digits) and annotates them with their numeric value

GATE

OAT

Ontology Annotation Tool.

GATE

OWLIM Ontology

Ontology created as a temporary OWLIM3 in-memory repository

GATE

Ontology created as a temporary OWLIM3 in-memory repository, for backwards compatibility only

GATE

Onto Root Gazetteer

An ontology lookup component

GATE

OntoGazetteer

A list lookup component based on mapping between ontology classes and gazetteer lists.

GATE

OrthoRef

An orthographic coreferencer

GATE

PMI Bank

Pointwise Mutual Information from corpora

GATE

PMI Example (English)

Example application for the PMI (pointwise mutual information) tool

GATE

POS Mapper

Map complex Russian morphology tags into simpler POS categories

GATE

Quality Assurance PR

The Quality Assurance PR provides a functionality of the Corpus QA Tool in GATE Developer

GATE

RAT-C

Relation Annotation Tool Class view.

GATE

RAT-I

Relation Annotation Tool Instance view.

GATE

RegEx Sentence Splitter

A sentence splitter based on regular expressions.

GATE

Roman Numerals Tagger

Finds and annotates Roman numerals

GATE

Romanian Gazetteer

A list lookup component.

GATE

Romanian IE System

Ready-made Romanian IE application

GATE

Romanian Tokeniser

A customisable Romanian tokeniser.

GATE

Romanian Transducer

A module for executing Jape grammars

GATE

RussIE

Basic version of the RussIE application

GATE

RussIE + Inflectional Gazetteer & OrthoMatcher

RussIE application with orthomatcher and inflexional gazetteer

GATE

RussIE + Inflectional Gazetter

RussIE application with inflexional gazetteer

GATE

RussIE + OrthoMatcher

RussIE application with orthomatcher

GATE

Russian Gazetteer

Customised version of the hash gazetteer

GATE

Russian POS Tagger

Part-of-speech tagger for Russian

GATE

Schema Annotations Editor

An annotation editor restricted by schemas.

GATE

Schema Enforcer

Produces an annotation set whose content is restricted by the specified set of schemas

GATE

Script Editor

Editor for the Groovy script behind this PR

GATE

Scriptable Controller

A controller whose execution strategy is controlled by a Groovy script

GATE

Search Results

Viewer for IR search results

GATE

SearchPR

Provides IR functionality.

GATE

Segment Processing PR

Processes individual segments as separate documents

GATE

Semantic Enrichment PR

The Semantic Enrichment PR allows adding new data to semantic annotations by querying external RDF (Linked Data) repositories.

GATE

Sharable Gazettee

A list lookup component.

GATE

Show/Hide Resources

Show resources that would otherwise be hidden, e.g. resources created for internal use by other resources

GATE

Simple Schema Viewer

A Simple Annotation Schema Viewer

GATE

Switch Member PR

Sets the focus of a compound document to a specified member document.

GATE

Syntax tree viewer

Viewer for syntax trees generated by a parser.

GATE

Termbank Score Copier

Copy scores from Termbanks back to their source annotations

GATE

Text Categorization PR

Classify text based on a semantic space

GATE

The Duplicator

Duplicate any resource with a right click menu option

GATE

Tweet Normaliser

Normalise texts in tweets (convert into standard English spelling mistakes, colloquialisms, typing variations and so on)

GATE

TwitIE (EN)

English TwitIE application

GATE

Twitter Tokenizer (EN)

Tokenizer tuned for Tweets

GATE

UIMA Analysis Engine

Wrapper for a Text Analysis Engine from UIMA.

GATE

Unload Unused Plugins

Unloads all plugins for which we cannot find any loaded instances

GATE

(original) ILSP (UIMA) (5)

The components listed here could not be associated with a known third-party tool collection and are assumed to be original components.

Component	Description	Framework
ILSP Chunker	No description	ILSP (UIMA)
ILSP FBT Tagger	ILSP FBT Tagger is an adaptation of the Brill tagger trained on Greek text.	ILSP (UIMA)
ILSP Lemmatizer	ILSP Lemmatizer consults a assigns lemmas to tokens from Greek texts.	ILSP (UIMA)
ILSP NERC	This module uses a Maximum Entropy NER engine focusing on EL or EN textual newsy data.	ILSP (UIMA)
_Sentence_and_Token_Segmentor,ILSP Paragraph, Sentence and Token Segmentor	This module is a regex and abbreviation based segmentor targetting texts written in Greek.	ILSP (UIMA)

Component

Description

Framework

ILSP Chunker

No description

ILSP (UIMA)

ILSP FBT Tagger

ILSP FBT Tagger is an adaptation of the Brill tagger trained on Greek text.

ILSP (UIMA)

ILSP Lemmatizer

ILSP Lemmatizer consults a assigns lemmas to tokens from Greek texts.

ILSP (UIMA)

ILSP NERC

This module uses a Maximum Entropy NER engine focusing on EL or EN textual newsy data.

ILSP (UIMA)

_Sentence_and_Token_Segmentor,ILSP Paragraph, Sentence and Token Segmentor

This module is a regex and abbreviation based segmentor targetting texts written in Greek.

ILSP (UIMA)

(original) NaCTeM (UIMA) (18)

The components listed here could not be associated with a known third-party tool collection and are assumed to be original components.

Component	Description	Framework
Agreement Evaluator	Reports agreement on annotations coming from different views (sofas).	NaCTeM (UIMA)
Anatomical Entity Tagger	Tags anatomical entities using Brown, UMLS and OBO Anatomy dictionary features	NaCTeM (UIMA)
Annotation Remover	Removes span-of-text annotations.	NaCTeM (UIMA)
Cafetiere Sentence Splitter	Uses a set of heuristics and patterns to find sentence boundaries.	NaCTeM (UIMA)
Dictionary Pluggable Soft TF/IDF Matcher	Tests input tokens whether they belong to an entry in the specified dictionary using SecondString Soft TF/IDF.	NaCTeM (UIMA)
Feature Generator	Generates a list of user-defined observations for each token.	NaCTeM (UIMA)
Kleio Search	Uses the Keio service to fetch MEDLINE abstracts matching a specified query.	NaCTeM (UIMA)
Medical Condition Tagger	A tagger that recognises mentions of medical conditions.	NaCTeM (UIMA)
NeMine	No description	NaCTeM (UIMA)
OSCAR 4 Tokeniser	Segments text into tokens.	NaCTeM (UIMA)
OscarMER	Runs Oscar 3 with maximum entropy based recogniser with syntactic tokens as input	NaCTeM (UIMA)
RO_FDGBank	This reader performs the transformation of the CONLL tab separated text format to the CAS ConllDependency format.	NaCTeM (UIMA)
Reference Evaluator	Reports annotation performance comparing views (sofas) to one selected reference view.	NaCTeM (UIMA)
Regex Annotator	Annotates spans of text based on a custom regular expression.	NaCTeM (UIMA)
SFTP BioNLP Shared Task Data Provider	Reads a corpus in BioNLP Shared Task format from a remote directory on a user-specified server via SFTP.	NaCTeM (UIMA)
Type Mapper	No description	NaCTeM (UIMA)
UMLS Full Dictionary Feature Extractor	Extracts Dictionary features from a UMLS-sourced dictionary	NaCTeM (UIMA)
Yeast Metabliner	This service is to annotate yeast metabolites with a supervised NER system using CRF.	NaCTeM (UIMA)

Component

Description

Framework

Agreement Evaluator

Reports agreement on annotations coming from different views (sofas).

NaCTeM (UIMA)

Anatomical Entity Tagger

Tags anatomical entities using Brown, UMLS and OBO Anatomy dictionary features

NaCTeM (UIMA)

Annotation Remover

Removes span-of-text annotations.

NaCTeM (UIMA)

Cafetiere Sentence Splitter

Uses a set of heuristics and patterns to find sentence boundaries.

NaCTeM (UIMA)

Dictionary Pluggable Soft TF/IDF Matcher

Tests input tokens whether they belong to an entry in the specified dictionary using SecondString Soft TF/IDF.

NaCTeM (UIMA)

Feature Generator

Generates a list of user-defined observations for each token.

NaCTeM (UIMA)

Kleio Search

Uses the Keio service to fetch MEDLINE abstracts matching a specified query.

NaCTeM (UIMA)

Medical Condition Tagger

A tagger that recognises mentions of medical conditions.

NaCTeM (UIMA)

NeMine

No description

NaCTeM (UIMA)

OSCAR 4 Tokeniser

Segments text into tokens.

NaCTeM (UIMA)

OscarMER

Runs Oscar 3 with maximum entropy based recogniser with syntactic tokens as input

NaCTeM (UIMA)

RO_FDGBank

This reader performs the transformation of the CONLL tab separated text format to the CAS ConllDependency format.

NaCTeM (UIMA)

Reference Evaluator

Reports annotation performance comparing views (sofas) to one selected reference view.

NaCTeM (UIMA)

Regex Annotator

Annotates spans of text based on a custom regular expression.

NaCTeM (UIMA)

SFTP BioNLP Shared Task Data Provider

Reads a corpus in BioNLP Shared Task format from a remote directory on a user-specified server via SFTP.

NaCTeM (UIMA)

Type Mapper

No description

NaCTeM (UIMA)

UMLS Full Dictionary Feature Extractor

Extracts Dictionary features from a UMLS-sourced dictionary

NaCTeM (UIMA)

Yeast Metabliner

This service is to annotate yeast metabolites with a supervised NER system using CRF.

NaCTeM (UIMA)

(service) AlchemyAPI (2)

Component	Description	Framework
AlchemyAPI: Entity Extraction	Runs the AlchemyAPI Entity Extraction service on a GATE document	GATE
AlchemyAPI: Keyword Extraction	Runs the AlchemyAPI Keyword Extraction service on a GATE document	GATE

Component

Description

Framework

AlchemyAPI: Entity Extraction

Runs the AlchemyAPI Entity Extraction service on a GATE document

GATE

AlchemyAPI: Keyword Extraction

Runs the AlchemyAPI Keyword Extraction service on a GATE document

GATE

(service) CrowdFlower (3)

Component	Description	Framework
Entity Annotation Job Builder	Build a CrowdFlower job asking users to annotate entities within a snippet of text	GATE
Entity Classification Job Builder	Build a CrowdFlower job asking users to select the right label for entities	GATE
Entity Classification Results Importer	Import judgments from a CrowdFlower job created by the Entity Classification Job Builder as GATE annotations.	GATE

Component

Description

Framework

Entity Annotation Job Builder

Build a CrowdFlower job asking users to annotate entities within a snippet of text

GATE

Entity Classification Job Builder

Build a CrowdFlower job asking users to select the right label for entities

GATE

Entity Classification Results Importer

Import judgments from a CrowdFlower job created by the Entity Classification Job Builder as GATE annotations.

GATE

(service) Lupedia (1)

Component	Description	Framework
Lupedia Service PR	Runs a lupedia annotation service on a GATE document	GATE

Component

Description

Framework

Lupedia Service PR

Runs a lupedia annotation service on a GATE document

GATE

(service) TextRazor (1)

Component	Description	Framework
TextRazor Service PR	Runs the TextRazor annotation service (http://textrazor.com) on a GATE document	GATE

Component

Description

Framework

TextRazor Service PR

Runs the TextRazor annotation service (http://textrazor.com) on a GATE document

GATE

(service) Textalytics (6)

Component	Description	Framework
Textalytics Language Identification	Textalytics Language Identification	GATE
_PoS_and_Parsing,Textalytics Lemmatization, PoS and Parsing	Textalytics Lemmatization, PoS and Parsing	GATE
Textalytics Sentiment Analysis	Textalytics Sentiment Analysis	GATE
_Grammar_and_Style_Proofreading,Textalytics Spell, Grammar and Style Proofreading	Textalytics Spell, Grammar and Style Proofreading	GATE
Textalytics Text Classification	Textalytics Text Classification	GATE
Textalytics Topics Extraction	Textalytics Topics Extraction	GATE

Component

Description

Framework

Textalytics Language Identification

GATE

_PoS_and_Parsing,Textalytics Lemmatization, PoS and Parsing

Textalytics Lemmatization, PoS and Parsing

GATE

Textalytics Sentiment Analysis

GATE

_Grammar_and_Style_Proofreading,Textalytics Spell, Grammar and Style Proofreading

Textalytics Spell, Grammar and Style Proofreading

GATE

Textalytics Text Classification

GATE

Textalytics Topics Extraction

GATE

(service) UAIC (6)

Component	Description	Framework
UAICDiacriticsDescriptor	No description	NaCTeM (UIMA)
UAICLemmav1	Assigns base forms to tokenised text.	NaCTeM (UIMA)
UAICLemmav2	Assigns base forms in Romanian text, given POS-tagged text.	NaCTeM (UIMA)
UAICSegV1	Splits texts into fragments	NaCTeM (UIMA)
UAICTokenizerDescriptor	No description	NaCTeM (UIMA)
UaicPosTagger	Carries out sentence splitting, tokenisation, POS tagging and lemmatitisation on plain text.	NaCTeM (UIMA)

Component

Description

Framework

UAICDiacriticsDescriptor

No description

NaCTeM (UIMA)

UAICLemmav1

Assigns base forms to tokenised text.

NaCTeM (UIMA)

UAICLemmav2

Assigns base forms in Romanian text, given POS-tagged text.

NaCTeM (UIMA)

UAICSegV1

Splits texts into fragments

NaCTeM (UIMA)

UAICTokenizerDescriptor

No description

NaCTeM (UIMA)

UaicPosTagger

Carries out sentence splitting, tokenisation, POS tagging and lemmatitisation on plain text.

NaCTeM (UIMA)

ABNER (2)

Component	Description	Framework
ABNER	Wraps the ABNER entity identification system into the UIMA framework.	NaCTeM (UIMA)
ABNER Tagger	GATE wrapper over ABNER	GATE

Component

Description

Framework

ABNER

Wraps the ABNER entity identification system into the UIMA framework.

NaCTeM (UIMA)

ABNER Tagger

GATE wrapper over ABNER

GATE

Arktweet (2)

Component	Description	Framework
ArktweetPosTagger	Wrapper for Twitter Tokenizer and POS Tagger.	DKPro Core (UIMA)
ArktweetTokenizer	ArkTweet tokenizer.	DKPro Core (UIMA)

Component

Description

Framework

ArktweetPosTagger

Wrapper for Twitter Tokenizer and POS Tagger.

DKPro Core (UIMA)

ArktweetTokenizer

ArkTweet tokenizer.

DKPro Core (UIMA)

BANNER (5)

Component	Description	Framework
BANNER CRF Tagger	A UIMA wrapper for BANNER entity tagger.	NaCTeM (UIMA)
Banner Base Tokenizer	Tokens returned by this class consist primarily of contiguous alphanumeric characters or single punctuation marks, however certain constructs such * as real numbers, percentages are recognized and returned as a single token.	NaCTeM (UIMA)
Banner Simple Tokenizer	Tokens ouput by this tokenizer consist of a contiguous block of alphanumeric characters or a single punctuation mark.	NaCTeM (UIMA)
Banner Whitespace Tokenizer	* Instances of this class tokenize {@link Sentence}s only at whitespace characters.	NaCTeM (UIMA)
EngLemmatiser	English lemmatiser which is adapted from WordNet.	NaCTeM (UIMA)

Component

Description

Framework

BANNER CRF Tagger

A UIMA wrapper for BANNER entity tagger.

NaCTeM (UIMA)

Banner Base Tokenizer

NaCTeM (UIMA)

Banner Simple Tokenizer

Tokens ouput by this tokenizer consist of a contiguous block of alphanumeric characters or a single punctuation mark.

NaCTeM (UIMA)

Banner Whitespace Tokenizer

* Instances of this class tokenize {@link Sentence}s only at whitespace characters.

NaCTeM (UIMA)

EngLemmatiser

English lemmatiser which is adapted from WordNet.

NaCTeM (UIMA)

BioCreative (2)

Component	Description	Framework
BioCreative Gene Mention Tagger	Tags Gene mentions using a model trained on BioCreative GM task data, with Entrez Gene and UMLS dictionary features.	NaCTeM (UIMA)
Chemical Entity Recogniser	A named entity recogniser capable of annotating names of chemicals, drugs and metabolites.	NaCTeM (UIMA)

Component

Description

Framework

BioCreative Gene Mention Tagger

Tags Gene mentions using a model trained on BioCreative GM task data, with Entrez Gene and UMLS dictionary features.

NaCTeM (UIMA)

Chemical Entity Recogniser

A named entity recogniser capable of annotating names of chemicals, drugs and metabolites.

NaCTeM (UIMA)

BioLG (1)

Component	Description	Framework
BioLG	Applies BioLG and lp2lp to sentences.	AlvisNLP

Component

Description

Framework

BioLG

Applies BioLG and lp2lp to sentences.

AlvisNLP

BulStem (1)

Component	Description	Framework
BulStem	This plugin is an implementation of the BulStem stemmer algorithm for Bulgarian developed by Preslav Nakov.	GATE

Component

Description

Framework

BulStem

This plugin is an implementation of the BulStem stemmer algorithm for Bulgarian developed by Preslav Nakov.

GATE

CCG (2)

Component	Description	Framework
CCGParser	Syntax parsing with CCG Parser.	AlvisNLP
CCGPosTagger	Applies the CCG POS tagger on annotations.	AlvisNLP

Component

Description

Framework

CCGParser

Syntax parsing with CCG Parser.

AlvisNLP

CCGPosTagger

Applies the CCG POS tagger on annotations.

AlvisNLP

CRF++ (2)

Component

Description

Framework

CRF++ Tagger

Uses Conditional Random Fields model for labeling.

NaCTeM (UIMA)

CRF++ Trainer

Produces a Conditional Random Fields model.

NaCTeM (UIMA)

Cjf (1)

Component

Description

Framework

CjfNormalizer

Converts traditional Chinese to simplified Chinese or vice-versa.

DKPro Core (UIMA)

ClearNLP (5)

Component

Description

Framework

ClearNlpLemmatizer

Lemmatizer using Clear NLP.

DKPro Core (UIMA)

ClearNlpParser

Clear parser annotator.

DKPro Core (UIMA)

ClearNlpPosTagger

Part-of-Speech annotator using Clear NLP.

DKPro Core (UIMA)

ClearNlpSegmenter

Tokenizer using Clear NLP.

DKPro Core (UIMA)

ClearNlpSemanticRoleLabeler

ClearNLP semantic role labeller.

DKPro Core (UIMA)

EnjuParser (3)

Component

Description

Framework

Enju Parser

A syntactic parser for English.

NaCTeM (UIMA)

EnjuParser

Parses sentences with the ENJU dependency parser.

AlvisNLP

EnjuParser2

synopsis

AlvisNLP

FreeLing (5)

Component

Description

Framework

Freeling Sentence Splitter

Performs tokenisation.

NaCTeM (UIMA)

FreelingMorpho

Performs tokenisation, and determines possible lemmas and POS tags for each token, with confidence scores.

NaCTeM (UIMA)

FreelingShallowParser

Performs tokenisation, lemmatisation, POS tagging and shallow parsing (chunking).

NaCTeM (UIMA)

FreelingTagger

Performs tokenisation, lemmatisation and POS tagging.

NaCTeM (UIMA)

FreelingTokenizer

Performs tokenisation.

NaCTeM (UIMA)

GATE Hepple (5)

Component

Description

Framework

ANNIE POS Tagger

Mark Hepple's Brill-style POS tagger

GATE

Cebuano POS Tagger

Mark Hepple's Brill-style POS tagger, adapted for languages where entries are multiword

GATE

Hepple POS Tagger

Mark Hepple's POS tagger, from dragontools/Banner toolkit.

NaCTeM (UIMA)

HepplePosTagger

GATE Hepple part-of-speech tagger.

DKPro Core (UIMA)

Hindi POS Tagger

Mark Hepple's Brill-style POS tagger, adapted for languages where entries are multiword

GATE

GENIA (5)

Component

Description

Framework

GENIA Dependency Parser

A dependency parser for biomedical text.

NaCTeM (UIMA)

A processing resource that takes document and corpus parameters

GATE

LingPipe Language Identifier PR

Machine learning-based sentence splitter optimized for biomedical texts.

NaCTeM (UIMA)

GENIA Tagger

Tags biological named entities: proteins, cell lines, cell types, DNAs, and RNAs.

NaCTeM (UIMA)

GeniaTagger

Runs Genia Tagger on annotations.

AlvisNLP

HunPos (1)

Component

Description

Framework

HunPosTagger

Part-of-Speech annotator using HunPos.

DKPro Core (UIMA)

IULA (2)

Component

Description

Framework

IULATagger

Performs paragraph splitting, sentence splitting, tokenisation and POS tagging.

NaCTeM (UIMA)

IULATokenizer

Performs paragraph splitting, sentence splitting, and tokenisation.

NaCTeM (UIMA)

JTok (1)

Component

Description

Framework

JTokSegmenter

JTok segmenter.

DKPro Core (UIMA)

Java BreakIterator (2)

Component

Description

Framework

Naive lexicon-based lemmatizer.

DKPro Core (UIMA)

LanguageToolSegmenter

Segmenter using LanguageTool to do the heavy lifting.

DKPro Core (UIMA)

LingPipe (6)

Component

Description

Framework

GATE PR for language identification using LingPipe

GATE

LingPipe NER PR

LingPipe Named Entity Recognizer

GATE

LingPipe POS Tagger PR

Provides a LingPipe part of speech tagger.

GATE

LingPipe Sentence Splitter

Sentence splitter based on LingPipe models.

NaCTeM (UIMA)

LingPipe Sentence Splitter PR

Provides an interface to LingPipe sentence splitter API.

GATE

LingPipe Tokenizer PR

Provides a LingPipe tokenizer.

GATE

Lucene/Solr (1)

Component

Description

Framework

Lucene IR Engine

No description

GATE

MLRS (3)

Component

Description

Framework

MLRS Maltese Tokeniser

Tokenises Maltese text

NaCTeM (UIMA)

MLRS Paragraph Splitter

Identifies the paragraphs in the text, creating a Paragraph annotation for each one

NaCTeM (UIMA)

MLRS Sentence Splitter

Identifies the sentences in the text, creating a Sentence annotation for each

NaCTeM (UIMA)

Mallet (1)

Component

Description

Framework

MalletTopicModelInferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

DKPro Core (UIMA)

MaltParser (2)

Component

Description

Framework

ILSP Dependency Parser

ILSP Dependency Parser is a tool trained on the Greek Dependency Treebank (Prokopidis et al., 2005), a resource which comprises data annotated at several linguistic levels.

ILSP (UIMA)

MaltParser

Dependency parsing using MaltPaser.

DKPro Core (UIMA)

Mate Tools (4)

Component

Description

Framework

MateLemmatizer

DKPro Annotator for the MateToolsLemmatizer.

DKPro Core (UIMA)

MateMorphTagger

DKPro Annotator for the MateToolsMorphTagger.

DKPro Core (UIMA)

Component

Description

Framework

OpenNLP Chunker

Chunker using an OpenNLP maxent model

GATE

OpenNLP NER

NER PR using a set of OpenNLP maxent models

GATE

OpenNLP POS Tagger

POS Tagger using an OpenNLP maxent model

GATE

OpenNLP Parser

Syntactic parser from Apache OpenNLP

GATE

OpenNLP Sentence Splitter

Sentence splitter using an OpenNLP maxent model

GATE

OpenNLP Tokenizer

Tokenizer using an OpenNLP maxent model

GATE

OpenNLPNEDetector

Detects named entities in text and creates corresponding entity annotations that span the found entities.

NaCTeM (UIMA)

OpenNLPParser

Parse the document and create phrasal and clausal annotations over the text.

NaCTeM (UIMA)

OpenNLPSentenceDetector

Detect sentence boundaries and create sentence annotations that span these boundaries.

NaCTeM (UIMA)

OpenNLPTokenizer

Tokenize the text and create token annotations that span the tokens.

NaCTeM (UIMA)

OpenNlpChunker

Chunk annotator using OpenNLP.

DKPro Core (UIMA)

OpenNlpNamedEntityRecognizer

OpenNLP name finder wrapper.

DKPro Core (UIMA)

OpenNlpParser

OpenNLP parser.

DKPro Core (UIMA)

OpenNlpPosTagger

Part-of-Speech annotator using OpenNLP.

DKPro Core (UIMA)

OpenNlpSegmenter

Tokenizer and sentence splitter using OpenNLP.

DKPro Core (UIMA)

Penn Bio-Tools (5)

Component

Description

Framework

Penn BioTagger

Ready-made application for the Penn BioTagger

GATE

Penn BioTagger: Genes

Penn BioTagger for Genes

GATE

Penn BioTagger: Malignancy

Penn BioTagger for malignancy types

GATE

Penn BioTagger: Variation

Penn BioTagger for variations

GATE

Penn BioTokenizer

Tokenizer for biomedical text

GATE

Porter Stemmer (1)

Component

Description

Framework

PorterStemmer

synopsis

AlvisNLP

RASP (5)

Component

Description

Framework

RASP POS Converter

Converts from PennTreebank POS tags to the C2 tagset used by RASP.

GATE

RASP2 Morphological Analyser

RASP morphological analyser, which adds lemma and suffix to the WordForm annotations produced by the RASP POS tagger (or the ANNIE POS tagger plus the RASP converter)

GATE

RASP2 POS Tagger

RASP part-of-speech tagger, creating WordForm annotations

GATE

RASP2 Parser

RASP dependency parser

GATE

RASP2 Tokenizer

RASP2 Tokenizer.

GATE

RfTagger (1)

Component

Description

Framework

RfTagger

Rftagger morphological analyzer.

DKPro Core (UIMA)

SPECIES (2)

Component

Description

Framework

Species

Calls the Species taxon tagger.

AlvisNLP

Species Tagger

Tags species

NaCTeM (UIMA)

STePP (1)

Component

Description

Framework

Stepp Tagger

No description

NaCTeM (UIMA)

SVMLight (2)

Component

Description

Framework

SVMLight Tagger

Applies an SVMLight-trained model on instances.

NaCTeM (UIMA)

SVMLight Trainer

Produces an SVMLight model based on user-specified learning parameters.

NaCTeM (UIMA)

Sfst (1)

Component

Description

Framework

SfstAnnotator

Sfst morphological analyzer.

DKPro Core (UIMA)

Snowball (2)

Component

Description

Framework

SnowballStemmer

UIMA wrapper for the Snowball stemmer.

DKPro Core (UIMA)

Stemmer PR

Wrapper for the Snowball stemmer.

GATE

Stanford (17)

Component

Description

Framework

English Dependency Parser

Ready-made application for Stanford English parser

GATE

English POS Tagger and Dependency Parser

Ready-made application for Stanford English POS tagger and parser

GATE

Stanford Dependency Parser

Generates Stanford-style dependencies together with POS tokens for English.

NaCTeM (UIMA)

Stanford NER

Stanford Named Entity Recogniser

GATE

Stanford POS Tagger

Stanford Part-of-Speech Tagger

GATE

Stanford PTB Tokenizer

Stanford Penn Treebank v3 Tokenizer, for English

GATE

StanfordCoreferenceResolver

No description

DKPro Core (UIMA)

StanfordDependencyConverter

Converts a constituency structure into a dependency structure.

DKPro Core (UIMA)

StanfordLemmatizer

Stanford Lemmatizer component.

DKPro Core (UIMA)

StanfordNER

synopsis

AlvisNLP

StanfordNamedEntityRecognizer

Stanford Named Entity Recognizer component.

DKPro Core (UIMA)

Stanford parser wrapper

GATE

TermRaider English Term Extraction

Stanford Parser component.

DKPro Core (UIMA)

StanfordPosTagger

Stanford Part-of-Speech tagger component.

DKPro Core (UIMA)

StanfordPtbTransformer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.

DKPro Core (UIMA)

StanfordSegmenter

No description

DKPro Core (UIMA)

Twitter POS Tagger (EN)

Stanford POS tagger trained on Tweets

GATE

TermRaider (6)

Component

Description

Framework

AnnotationTermbank

TermRaider Termbank derived from document annotations

GATE

HyponymyTermbank

TermRaider Termbank derived from head/string hyponymy

GATE

Pairbank Viewer

viewer for the TermRaider Pairbank

GATE

Example application showing typical set-up for the TermRaider tools

GATE

Termbank Viewer

viewer for the TermRaider Termbank

GATE

TfIdfTermbank

TermRaider Termbank derived from vectors in document features

GATE

TextCat (3)

Component

Description

Framework

LanguageIdentifier

Detection based on character n-grams.

DKPro Core (UIMA)

TextCat Fingerprint Generator

Generate language fingerprints for use with the TextCat Language Indentification PR

GATE

TextCat Language Identification

Recognizes the document language using TextCat

GATE

TreeTagger (3)

Component

Description

Framework

TreeTagger

Runs tree-tagger.

AlvisNLP

TreeTaggerChunker

Chunk annotator using TreeTagger.

DKPro Core (UIMA)

TreeTaggerPosTagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

DKPro Core (UIMA)

Web1T (1)

Component

Description

Framework

LanguageDetectorWeb1T

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

DKPro Core (UIMA)

WordNet (4)

Component

Description

Framework

SemanticFieldAnnotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.

DKPro Core (UIMA)

WordNet

GATE

WordNet 1.6

Princeton WordNet 1.6.

GATE

WordNet Viewer

WordNet viewer

GATE

Yatea (2)

Component

Description

Framework

YateaExtractor

Extract terms from the corpus using the YaTeA term extractor.

AlvisNLP

YateaProjector

synopsis

AlvisNLP

Zemanta (1)

Component

Description

Framework

Zemanta Service PR

Runs a zemanta annotation service on a GATE document

GATE

I/O components by format

Uncategorized (47)

Component

Description

Framework

ACE Corpus Reader

Reads ...

NaCTeM (UIMA)

ADBWriter

synopsis

AlvisNLP

AlvisAEReader

reads documents and annotations from an AlvisAE campaign.

AlvisNLP

AlvisAEReader2

reads documents and annotations from an AlvisAE campaign.

AlvisNLP

AlvisDBIndexer

synopsis

AlvisNLP

AlvisIRIndexer

synopsis

AlvisNLP

AnimalReader

Project-specific file reader.

AlvisNLP

BIO Format Collection Reader

Reads BIO format files from specified directory.

NaCTeM (UIMA)

BIO Format Writer Cas Consumer

Writes specified types of annotations to the specified directory in the BIO format.

NaCTeM (UIMA)

BioC Reader

Reads a file in BioC format.

NaCTeM (UIMA)

BioC Writer

Writes BioC annotations to files.

NaCTeM (UIMA)

BioCreative CHEMDNER Reader

Reads data prepared specifically for the BioCreative IV's CHEMDNER track.

NaCTeM (UIMA)

BioNLP ST Data Reader

Reads files formatted for the BioNLP Shared Task series and outputs documents with named entity, relation and event annotations.

NaCTeM (UIMA)

BioNLP ST Data Writer

Writes BioNLP entity and event annotations to files.

NaCTeM (UIMA)

BlikiWikipediaReader

Bliki-based Wikipedia reader.

DKPro Core (UIMA)

CombinationReader

Combines multiple readers into a single reader.

DKPro Core (UIMA)

Configurable Exporter

Allows annotations to be exported according to a specified format.

GATE

Entity Annotation Results Importer

Import judgments from a CrowdFlower job created by the Entity Annotation Job Builder as GATE annotations.

GATE

ExpressionExtract

Write elements in a tab separated file.

AlvisNLP

FillDB

Stores the corpus into a SQL database.

AlvisNLP

Flexible Exporter

Exports a document with GATE annotations to its original format.

GATE

HtmlReader

Reads the contents of a given URL and strips the HTML.

DKPro Core (UIMA)

ILSP File System Collection Reader

Reads files from the filesystem.

ILSP (UIMA)

LIBSVMReader

Reads a dataset in LIBSVM format

NaCTeM (UIMA)

Legacy Coref Data Writer

A simple PR that converts co-reference data from the Relations-based model to the legacy format (based on 'matches' annotation and document features).

GATE

MalletTopicProportionsWriter

Write topic proportions to a file in the shape depends on the {@link TopicDistribution annotation which should have been created by MalletTopicModelInferencer before.

DKPro Core (UIMA)

MalletTopicsProportionsSortedWriter

Write the topic proportions according to an LDA topic model to an output file.

DKPro Core (UIMA)

PubTatorReader

synopsis

AlvisNLP

Shared Task 2004 Reader

Reads training or evaluation data from the BioNLP/NLPBA 2004 Bio-Entity Recognition Task

NaCTeM (UIMA)

TGrepWriter

TGrep2 corpus file writer.

DKPro Core (UIMA)

TSV Reader

No description

NaCTeM (UIMA)

TSV Writer

Saves annotations of a selected type to a file in tab-separated-value format.

NaCTeM (UIMA)

TabularExport

Writes the corpus data structure in files in tabular format.

AlvisNLP

TabularReader

synopsis

AlvisNLP

TfidfConsumer

This consumer builds a DfModel.

DKPro Core (UIMA)

TreeTaggerReader

Read files in tree-tagger output format and creates a document for each file read.

AlvisNLP

Twitter Collection Reader

No description

NaCTeM (UIMA)

Twitter Corpus Populator

Populate a corpus from Twitter JSON containing multiple Tweets

GATE

TwitterDatabaseConsumer

No description

NaCTeM (UIMA)

WebOfKnowledgeReader

Reads Web of Knowledge search result import files.

AlvisNLP

WhatsWrongExport

Writes files in What's Wrong with my NLP format.

AlvisNLP

WikipediaArticleInfoReader

Reads all general article infos without retrieving the whole Page objects

DKPro Core (UIMA)

WikipediaDiscussionReader

Reads all discussion pages.

DKPro Core (UIMA)

WikipediaLinkReader

Read links from Wikipedia.

DKPro Core (UIMA)

WikipediaQueryReader

Reads all article pages that match a query created by the numerous parameters of this class.

DKPro Core (UIMA)

WikipediaRevisionPairReader

Reads pairs of adjacent revisions of all articles.

DKPro Core (UIMA)

WikipediaRevisionReader

Reads Wikipedia page revisions.

DKPro Core (UIMA)

AclAnthology (1)

Component

Description

Framework

AclAnthologyReader

Reada the ACL anthology corpus and outputs CASes with plain text documents.

DKPro Core (UIMA)

Alvis Enriched Document (1)

Component

Description

Framework

EnrichedDocumentWriter

Writes the corpus in the infamous Alvis Enriched Document Format suitable for indexation with Zebra-Alvis.

AlvisNLP

BNC (1)

Component

Description

Framework

BncReader

Reader for the British National Corpus (XML version).

DKPro Core (UIMA)

BioNLP Shared Task (2)

Component

Description

Framework

GeniaReader

Reads text files and their associated annotation files in BioNLP Shared Task format.

AlvisNLP

GeniaWriter

Component

Description

Framework

ExportCadixeJSON

Writes each document in a file in the AlvisAE protocol format.

AlvisNLP

CoNLL 2000 (2)

Component

Description

Framework

Conll2000Reader

Reads the Conll 2000 chunking format.

DKPro Core (UIMA)

Conll2000Writer

Writes the CoNLL 2000 chunking format.

DKPro Core (UIMA)

CoNLL 2002 (2)

Component

Description

Framework

Conll2002Reader

Reads the CoNLL 2002 named entity format.

DKPro Core (UIMA)

Conll2002Writer

Writes the CoNLL 2002 named entity format.

DKPro Core (UIMA)

CoNLL 2006 (2)

Component

Description

Framework

Conll2006Reader

Reads a file in the CoNLL-2006 format (aka CoNLL-X).

DKPro Core (UIMA)

Conll2006Writer

Component

Description

Framework

GATE .cochrane.txt document format

Load this to allow the opening of Cochrane text documents, and choose the mime type "text/x-cochrane", or use the correct file extension.

GATE

DataSift JSON (1)

Component

Description

Framework

GATE DataSift JSON Document Format

Format parser for DataSift JSON files

GATE

Factored Tag Lem (1)

Component

Description

Framework

Factored Tag Lem Consumer

Writes sentences from the CAS in the Factored Tag Lem format

ILSP (UIMA)

Fast Infoset (2)

Component

Description

Framework

Fast Infoset Document Format

Format parser for GATE XML stored in the binary Fast Infoset format

GATE

Fast Infoset Exporter

Export GATE documents to GATE XML stored in the binary Fast Infoset format

GATE

GATE JSON (1)

Component

Description

Framework

GATE JSON Exporter

Export documents and corpora in JSON format

GATE

GATE XML (2)

Component

Description

Framework

Component

Description

Framework

Penn Treebank combined format writer.

DKPro Core (UIMA)

Prague Markup Language (1)

Component

Description

Framework

ILSP PML Cas Consumer

Writes sentences from the CAS in the Prague Markup Language format for editing dependency structures in TrEd

ILSP (UIMA)

PubMed (2)

Component

Description

Framework

GATE .pubMed.txt document format

Load this to allow the opening of PubMed text documents, and choose the mime type "text/x-pubmed"or use the correct file extension.

GATE

PubMed Abstract Reader

Fetches PubMed abstracts from NaCTeM's Kleio service.

NaCTeM (UIMA)

RDF (3)

Component

Description

Framework

RDF Reader

Reads Common Annotation Structures (CASes) from RDF-encoded files.

NaCTeM (UIMA)

RDF Writer

Saves Common Annotation Structures into RDF files.

NaCTeM (UIMA)

RDFExport

synopsis

AlvisNLP

RTF (1)

Component

Description

Framework

RTFReader

Read RTF (Rich Test Format) files.

DKPro Core (UIMA)

Relp (1)

Component

Description

Framework

RelpWriter

Writes the corpus in relp format.

AlvisNLP

Reuters-21578 (2)

Component

Description

Framework

Reuters21578SgmlReader

Read a Reuters-21578 corpus in SGML format.

DKPro Core (UIMA)

Reuters21578TxtReader

Read a Reuters-21578 corpus that has been transformed into text format using ExtractReuters in the lucene-benchmarks project.

DKPro Core (UIMA)

Solr (1)

Component

Description

Framework

SolrWriter

A simple implementation of SolrWriter_ImplBase

DKPro Core (UIMA)

TEI-XML (4)

Component

Description

Framework

Aimed Collection Reader

Reads Aimed corpus (225 abstracts from MEDLINE) with the gold standard sentence, protein, protein-protein interaction anntations.

NaCTeM (UIMA)

TeiReader

Reader for the TEI XML.

DKPro Core (UIMA)

TeiWriter

UIMA CAS consumer writing the CAS document text in TEI format.

DKPro Core (UIMA)

WikipediaTemplateFilteredArticleReader

Reads all pages that contain or do not contain the templates specified in the template whitelist and template blacklist.

DKPro Core (UIMA)

TIGER-XML (2)

Component

Description

Framework

TigerXmlReader

UIMA collection reader for TIGER-XML files.

DKPro Core (UIMA)

TigerXmlWriter

UIMA CAS consumer writing the CAS document text in the TIGER-XML format.

DKPro Core (UIMA)

Text (14)

Component

Description

Framework

AssertAnnotations$InternalStringReader

Descriptor automatically generated by uimaFIT

DKPro Core (UIMA)

EuropePMC Open Access Reader

Reads open-access full-text articles from the Europe PMC web service

NaCTeM (UIMA)

FSOVFileReader

Project-specific text file reader.

AlvisNLP

Input Text Reader

Reads text supplied in a parameter.

NaCTeM (UIMA)

Merge GENIA-coref with -term Collection Reader

Read GENIA-coref files and GENIA-event/-term files and merge each couple into one CAS.

NaCTeM (UIMA)

SFTP Document Reader

Reads plain-text documents from a remote directory on a user-specified server via SFTP.

NaCTeM (UIMA)

Simplified Text Exporter

Simplified text exporter (plain text output)

GATE

StringReader

Simple reader that generates a CAS from a String.

DKPro Core (UIMA)

TextFileReader

Reads files and adds a document in the corpus for each file.

AlvisNLP

TextReader

UIMA collection reader for plain text files.

DKPro Core (UIMA)

TextWriter

UIMA CAS consumer writing the CAS document text as plain text file.

DKPro Core (UIMA)

TokenizedTextWriter

This class writes a set of pre-processed documents into a large text file containing one sentence per line and tokens split by whitespaces.

DKPro Core (UIMA)

WikipediaArticleReader

Reads all article pages.

DKPro Core (UIMA)

WikipediaPageReader

Reads all Wikipedia pages in the database (articles, discussions, etc).

DKPro Core (UIMA)

TüPP-D/Z (1)

Component

Description

Framework

TueppReader

UIMA collection reader for Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) XML files.

DKPro Core (UIMA)

Twitter JSON (1)

Component

Description

Framework

GATE JSON Tweet Document Format

Format parser for Twitter JSON files

GATE

UIMA Binary CAS (4)

Component

Description

Framework

BinaryCasReader

UIMA Binary CAS formats reader.

DKPro Core (UIMA)

BinaryCasWriter

Write CAS in one of the UIMA binary formats.

DKPro Core (UIMA)

SerializedCasReader

No description

DKPro Core (UIMA)

SerializedCasWriter

No description

DKPro Core (UIMA)

UIMA CAS Dump (1)

Component

Description

Framework

CasDumpWriter

Dumps CAS content to a text file.

DKPro Core (UIMA)

UIMA JSON (1)

Component

Description

Framework

JsonWriter

UIMA JSON format writer.

DKPro Core (UIMA)

Web1T (1)

Component

Description

Framework

Web1TWriter

Web1T n-gram index format writer.

DKPro Core (UIMA)

XCES (2)

Component

Description

Framework

ILSP XCES Consumer

Writes sentences from the CAS to the XCES format

ILSP (UIMA)

XcesReaderDescriptor

Reads XCES XML files.

ILSP (UIMA)

XMI (7)

Component

Description

Framework

ILSP Xmi Writer CAS Consumer

Serializes the CAS to XMI.

ILSP (UIMA)

SFTP XMI Reader

Reads an XMI-formatted corpus from an SFTP-enabled server.

NaCTeM (UIMA)

SFTP XMI Writer

Saves Common Annotation Structures to an SFTP server

NaCTeM (UIMA)

XMI Reader

Reads common annotation structures (CAS) from files in XMI format.

NaCTeM (UIMA)

XMI Writer

Serialises entires common annotation structures (CAS) to XMI format.

NaCTeM (UIMA)

XmiReader

Reader for UIMA XMI files.

DKPro Core (UIMA)

XmiWriter

UIMA XMI format writer.

DKPro Core (UIMA)

XML (12)

Component

Description

Framework

ExportAlignmentPR

A PR to export alignment information in an xml file.

GATE

InlineXmlWriter

Writes an approximation of the content of a textual CAS as an inline XML file.

DKPro Core (UIMA)

MediaWiki Corpus Populater

Populate a corpus from a MediaWiki XML dump

GATE

MediaWiki XML Document Format

Deprecated MediaWiki importer

GATE

XMLReader

Reads a corpus in XML files.

AlvisNLP

XMLReader2

Reads XML files and creates elements.

AlvisNLP

XMLWriter

Writes an XML serialization of the corpus into a file.

AlvisNLP

XMLWriter2

Writes the corpus data structure into a file via an XSLT stylesheet.

AlvisNLP

XMLWriter2ForINIST

synopsis

AlvisNLP

XmlReader

Reader for XML files.

DKPro Core (UIMA)

XmlTextReader

No description

DKPro Core (UIMA)

XmlXPathReader

A component reader for XML files implemented with XPath.

DKPro Core (UIMA)

Component details

Uncategorized (132)

ANNIE NE Transducer

Category: Uncategorized
Framework: GATE
Version: unknown

ANNIE named entity grammar.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationAccessors

—

java.util.List

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

enableDebugging

—

java.lang.Boolean

—

false

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

resources/NE/main.jape

—

inputASName

—

java.lang.String

—

true

operators

—

java.util.List

—

outputASName

—

java.lang.String

—

true

ANNIE OrthoMatcher

Category: Uncategorized
Framework: GATE
Version: unknown

ANNIE orthographical coreference component.

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
annotationSetName	—	java.lang.String	—	—	—	true
annotationTypes	—	java.util.List	—	Organization;Person;Location;Date	—	true
caseSensitive	—	java.lang.Boolean	—	false	—	—
corpus	—	gate.Corpus	—	—	—	true
definitionFileURL	—	java.net.URL	—	resources/othomatcher/listsNM.def	—	—
document	—	gate.Document	—	—	—	true
encoding	—	java.lang.String	—	UTF-8	—	—
extLists	—	java.lang.Boolean	—	true	—	—
highPrecisionOrgs	—	java.lang.Boolean	—	false	—	—
minimumNicknameLikelihood	—	java.lang.Double	—	0.50	—	—
organizationType	—	java.lang.String	—	Organization	—	—
personType	—	java.lang.String	—	Person	—	—
processUnknown	—	java.lang.Boolean	—	true	—	—

ANNIE+Measurements

Category: Uncategorized
Framework: GATE
Version: unknown

Ready-made application for ANNIE plus the measurement tagger

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Ab3P

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

installDir

—

org.bibliome.util.files.InputDirectory

True

—

longFormFeature

—

java.lang.String

True

—

longFormRole

—

java.lang.String

True

—

longFormsLayerName

—

java.lang.String

True

—

relationName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

shortFormRole

—

java.lang.String

True

—

shortFormsLayerName

—

java.lang.String

True

—

Action

Category: Uncategorized
Framework: AlvisNLP
Version: 2012-04-30

Applies action expressions on selected elements.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

action

—

alvisnlp.corpus.expressions.Expression

True

—

active

—

alvisnlp.corpus.expressions.Expression

True

—

addToLayer

—

java.lang.Boolean

False

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

createAnnotations

—

java.lang.Boolean

False

—

createDocuments

—

java.lang.Boolean

False

—

createRelations

—

java.lang.Boolean

False

—

createSections

—

java.lang.Boolean

False

—

createTuples

—

java.lang.Boolean

False

—

deleteElements

—

java.lang.Boolean

False

—

removeFromLayer

—

java.lang.Boolean

False

—

setArguments

—

java.lang.Boolean

False

—

setFeatures

—

java.lang.Boolean

False

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

AggregateValues

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

aggregators

—

org.bibliome.alvisnlp.modules.aggregate.Aggregator[]

True

—

entries

—

alvisnlp.corpus.expressions.Expression

True

—

key

—

alvisnlp.corpus.expressions.Expression

True

—

outFile

—

org.bibliome.util.streams.TargetStream

True

—

separator

—

java.lang.Character

True

—

Agreement Evaluator

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Reports agreement on annotations coming from different views (sofas).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

OutputFile

—

String

True

—

false

—

AlchemyAPI: Entity Extraction

Category: Uncategorized
Framework: GATE
Version: unknown

Runs the AlchemyAPI Entity Extraction service on a GATE document

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationType

—

java.lang.String

—

Mention

—

true

apiKey

—

java.lang.String

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

numberOfSentencesInBatch

—

java.lang.Integer

—

true

numberOfSentencesInContext

—

java.lang.Integer

—

true

outputASName

—

java.lang.String

—

true

AlchemyAPI: Keyword Extraction

Category: Uncategorized
Framework: GATE
Version: unknown

Runs the AlchemyAPI Keyword Extraction service on a GATE document

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationType

—

java.lang.String

—

Keyword

—

true

apiKey

—

java.lang.String

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

numberOfSentencesInBatch

—

java.lang.Integer

—

true

numberOfSentencesInContext

—

java.lang.Integer

—

true

outputASName

—

java.lang.String

—

true

AlvisREPrepareCrossValidation

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

cParameter

—

java.lang.Double

True

—

dependencies

—

org.bibliome.alvisnlp.modules.alvisre.AlvisRERelations

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

folds

—

java.lang.Integer

True

—

outDir

—

org.bibliome.util.files.OutputDirectory

True

—

relations

—

org.bibliome.alvisnlp.modules.alvisre.AlvisRERelations[]

True

—

schema

—

org.w3c.dom.DocumentFragment

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sectionSeparator

—

java.lang.String

True

—

sentences

—

org.bibliome.alvisnlp.modules.alvisre.AlvisRETokens

True

—

similarityFunction

—

org.w3c.dom.DocumentFragment

True

—

terms

—

org.bibliome.alvisnlp.modules.alvisre.AlvisRETokens[]

True

—

words

—

org.bibliome.alvisnlp.modules.alvisre.AlvisRETokens

True

—

AnchorTuples

Category: Uncategorized
Framework: AlvisNLP
Version: 2012-04-30

Creates tuples with a common argument.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

anchor

—

alvisnlp.corpus.expressions.Expression

True

—

anchorRole

—

java.lang.String

True

—

arguments

—

alvisnlp.module.types.ExpressionMapping

True

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

relationName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

Annotation Remover

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Removes span-of-text annotations.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

mode

Set to 'remove' if you wish to remove annotations of the types given in 'types'. Set to 'retain' if you wish to retain only the annotations of the types given in 'types'.

String

True

—

false

—

types

List of annotation types.

String

True

—

true

—

AnnotationTermbank

Category: Uncategorized
Framework: GATE
Version: unknown

TermRaider Termbank derived from document annotations

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpora

—

java.util.Set

—

debugMode

—

java.lang.Boolean

—

false

—

idDocumentFeature

—

java.lang.String

—

inputASName

—

java.lang.String

—

inputAnnotationFeature

—

java.lang.String

—

canonical

—

inputAnnotationTypes

—

java.util.Set

—

SingleWord;MultiWord

—

inputScoreFeature

—

java.lang.String

—

localAugTfIdf

—

languageFeature

—

java.lang.String

—

lang

—

mergingMode

—

gate.termraider.modes.MergingMode

—

MAXIMUM

—

normalization

—

gate.termraider.modes.Normalization

—

Sigmoid

—

scoreProperty

—

java.lang.String

—

tfIdfAug

—

AntecedentChoice

Category: Uncategorized
Framework: AlvisNLP
Version: 2012-04-30

Biotopes-specific module: chooses an antecedent.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

Arabic Gazetteer Collector

Category: Uncategorized
Framework: GATE
Version: unknown

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

Arabic

—

pipelineURL

—

java.net.URL

—

resources/arabic_lists_collector.gapp

—

Arabic Main Grammar

Category: Uncategorized
Framework: GATE
Version: unknown

A module for executing Jape grammars.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationAccessors

—

java.util.List

—

binaryGrammarURL

—

java.net.URL

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

enableDebugging

—

java.lang.Boolean

—

false

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

resources/grammar/main.jape

—

inputASName

—

java.lang.String

—

true

ontology

—

gate.creole.ontology.Ontology

—

true

operators

—

java.util.List

—

outputASName

—

java.lang.String

—

true

Arabic OrthoMatcher

Category: Uncategorized
Framework: GATE
Version: unknown

ANNIE orthographical coreference component.

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
annotationSetName	—	java.lang.String	—	—	—	true
annotationTypes	—	java.util.List	—	Organization;Person;Location;Date	—	true
caseSensitive	—	java.lang.Boolean	—	false	—	—
corpus	—	gate.Corpus	—	—	—	true
definitionFileURL	—	java.net.URL	—	resources/orthomatcher/listsNM.def	—	—
document	—	gate.Document	—	—	—	true
encoding	—	java.lang.String	—	UTF-8	—	—
extLists	—	java.lang.Boolean	—	true	—	—
highPrecisionOrgs	—	java.lang.Boolean	—	false	—	—
minimumNicknameLikelihood	—	java.lang.Double	—	0.50	—	—
organizationType	—	java.lang.String	—	Organization	—	—
personType	—	java.lang.String	—	Person	—	—
processUnknown	—	java.lang.Boolean	—	true	—	—

Assert

Category: Uncategorized
Framework: AlvisNLP
Version:

Tests an assertion on specified elements.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

assertion

—

alvisnlp.corpus.expressions.Expression

True

—

severe

—

java.lang.Boolean

True

—

stopAt

—

java.lang.Integer

False

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

[[AssertAnnotations$InternalJCasHolder]] ==== AssertAnnotations$InternalJCasHolder

Category: Uncategorized
Framework: DKPro Core (UIMA)
Version: 1.8.0

Descriptor automatically generated by uimaFIT

AttestedTermsProjector

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Projects a list of terms given in tree-tagger format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

errorDuplicateValues

—

java.lang.Boolean

False

—

ignoreCase

—

java.lang.Boolean

False

—

ignoreDiacritics

—

java.lang.Boolean

False

—

ignoreWhitespace

—

java.lang.Boolean

False

—

lemmaFeatureName

—

java.lang.String

True

—

lemmaKeys

—

java.lang.Boolean

True

—

multipleValueAction

—

org.bibliome.alvisnlp.modules.projectors.MultipleValueAction

True

—

normalizeSpace

—

java.lang.Boolean

False

—

posFeatureName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

subject

—

org.bibliome.alvisnlp.modules.projectors.Subject

True

—

targetLayerName

—

java.lang.String

True

—

termFeatureName

—

java.lang.String

False

—

termsFile

—

org.bibliome.util.streams.SourceStream

True

—

BDM Computation PR

Category: Uncategorized
Framework: GATE
Version: unknown

Compute BDM score for each pair of concepts in the given ontology.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ontology

—

gate.creole.ontology.Ontology

—

true

outputBDMFile

—

java.net.URL

—

true

Banner Sentence Breaker

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Sentence breaker using the Sun Java API "BreakIterator".

BioLG

Category: Uncategorized
Framework: AlvisNLP
Version: 2012-04-30

Applies BioLG and lp2lp to sentences.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

dependencyLabelFeature

—

java.lang.String

True

—

dependencyRelation

—

java.lang.String

True

—

dependentRole

—

java.lang.String

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

headRole

—

java.lang.String

True

—

linkageNumberFeature

—

java.lang.String

True

—

lp2lpConf

—

org.bibliome.util.files.InputFile

True

—

lp2lpExecutable

—

org.bibliome.util.files.ExecutableFile

True

—

maxLinkages

—

java.lang.Integer

False

—

parserPath

—

org.bibliome.util.files.WorkingDirectory

True

—

posFeature

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayer

—

java.lang.String

True

—

sentenceRole

—

java.lang.String

True

—

timeout

—

java.lang.Integer

True

—

union

—

java.lang.Boolean

True

—

wordLayer

—

java.lang.String

True

—

wordNumberLimit

—

java.lang.Integer

True

—

CSV Corpus Populater

Category: Uncategorized
Framework: GATE
Version: unknown

Populate a corpus from CSV files

CartesianProductTuples

Category: Uncategorized
Framework: AlvisNLP
Version: 2012-04-30

Creates tuples for each element of a Cartesian product.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

anchor

—

alvisnlp.corpus.expressions.Expression

True

—

arguments

—

alvisnlp.module.types.ExpressionMapping

True

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

relationName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

Cebuano Transducer

Category: Uncategorized
Framework: GATE
Version: unknown

A module for executing Jape grammars.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationAccessors

—

java.util.List

—

binaryGrammarURL

—

java.net.URL

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

enableDebugging

—

java.lang.Boolean

—

false

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

resources/grammar/main.jape

—

inputASName

—

java.lang.String

—

true

ontology

—

gate.creole.ontology.Ontology

—

true

operators

—

java.util.List

—

outputASName

—

java.lang.String

—

true

Cebuano Transducer Postprocessor

Category: Uncategorized
Framework: GATE
Version: unknown

A module for executing Jape grammars.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationAccessors

—

java.util.List

—

binaryGrammarURL

—

java.net.URL

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

enableDebugging

—

java.lang.Boolean

—

false

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

resources/tokeniser/join.jape

—

inputASName

—

java.lang.String

—

true

ontology

—

gate.creole.ontology.Ontology

—

true

operators

—

java.util.List

—

outputASName

—

java.lang.String

—

true

Chemical Entity Recogniser

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 0.1

A named entity recogniser capable of annotating names of chemicals, drugs and metabolites. Built on top of the NERsuite package [1]. Available models: Chemical: trained on the BioCreative IV CHEMDNER Track training and development corpora [2] Drug: trained on the DDI training corpus [3] Metabolite: trained on NaCTeM's Metabolite corpus [4] Dictionaries used: Chemical: ChEBI [5], DrugBank [6], CTD Chemicals [7], PubChem Compound [8], Jochem [9] Drug: DrugBank [6] Metabolite: ChEBI [5], Human Metabolome Database [10] Links: [1] http://nersuite.nlplab.org [2] http://www.biocreative.org/resources/corpora/bc-iv-chemdner-corpus [3] http://labda.inf.uc3m.es/doku.php?id=en:labda_ddicorpus [4] http://www.nactem.ac.uk/metabolite-corpus [5] http://www.ebi.ac.uk/chebi [6] http://www.drugbank.ca [7] http://ctdbase.org [8] http://pubchem.ncbi.nlm.nih.gov [9] http://www.biosemantics.org/new/index.php?page=Jochem [10] http://www.hmdb.ca

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

model

The model to use

String

True

—

false

—

performAbbreviationRecognition

Additionally perform abbreviation recognition

Boolean

False

—

false

—

performTokenRelabelling

Additionally perform relabelling based on token chemical composition

Boolean

False

—

false

—

ColognePhoneticTranscriptor

Category: Uncategorized
Framework: DKPro Core (UIMA)
Version: 1.8.0

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. Works for German.

Compound Document

Category: Uncategorized
Framework: GATE
Version: unknown

GATE Compound Document.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

collectRepositioningInfo

—

java.lang.Boolean

—

false

—

documentIDs

—

java.util.ArrayList

—

encoding

—

java.lang.String

—

UTF-8

—

markupAware

—

java.lang.Boolean

—

true

—

preserveOriginalContent

—

java.lang.Boolean

—

false

—

sourceUrl

—

java.net.URL

—

Compound Document From Xml

Category: Uncategorized
Framework: GATE
Version: unknown

GATE Compound Document.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compoundDocumentUrl

—

java.net.URL

—

encoding

—

java.lang.String

—

UTF-8

—

ConnectSesameOntology

Category: Uncategorized
Framework: GATE
Version: unknown

Connect to a repository containing and ontology

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

repositoryID

—

java.lang.String

—

repositoryLocation

—

java.net.URL

—

Control Script

Category: Uncategorized
Framework: GATE
Version: unknown

Editor for the Groovy script controlling a scriptable controller

Copy Anns to Another Doc PR

Category: Uncategorized
Framework: GATE
Version: unknown

Copy the annotations from one document to another document.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationTypes

—

java.util.List

—

true

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

sourceFilesURL

—

java.net.URL

—

true

Corpus Indexing Support

Category: Uncategorized
Framework: GATE
Version: unknown

Crawler PR

Category: Uncategorized
Framework: GATE
Version: unknown

GATE implementation of the Websphinx crawling API

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
convertXmlTypes	—	java.lang.Boolean	—	true	—	true
depth	—	java.lang.Integer	—	3	—	true
dfs	—	java.lang.Boolean	—	true	—	true
domain	—	crawl.DomainMode	—	SUBTREE	—	true
keywords	—	java.util.List	—	—	—	true
keywordsCaseSensitive	—	java.lang.Boolean	—	true	—	true
max	—	java.lang.Integer	—	-1	—	true
maxPageSize	—	java.lang.Integer	—	100	—	true
outputCorpus	—	gate.Corpus	—	—	—	true
root	—	java.lang.String	—	—	—	true
source	—	gate.Corpus	—	—	—	true
stopAfter	—	java.lang.Integer	—	-1	—	true
userAgent	—	java.lang.String	—	—	—	true

CreateSesameOntology

Category: Uncategorized
Framework: GATE
Version: unknown

Create a ontology from a Sesame configuration file for a repository

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

configFile

—

java.net.URL

—

repositoryID

—

java.lang.String

—

repositoryLocation

—

java.net.URL

—

[[Dictionary_Pluggable_Soft_TF/IDF_Matcher]] ==== Dictionary Pluggable Soft TF/IDF Matcher

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Tests input tokens whether they belong to an entry in the specified dictionary using SecondString Soft TF/IDF. The dictionary should have suffix of .list for its file name, and its format should be (Format: key1 TAB alias11 TAB alias12 ... NEWLINE key2 TAB alias21 TAB alias22 ...)

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

DictionaryFile

File which contains the dictionary (Format: key1 TAB alias11 TAB alias12 … NEWLINE key2 TAB alias21 TAB alias22 …)

String

True

—

false

—

MaxTokenCombination

—

Integer

False

—

false

—

MinMatchingScore

—

Float

False

—

false

—

DisambiguateAlternatives

Category: Uncategorized
Framework: AlvisNLP
Version: 2012-04-30

Disambiguate features that have multiple values.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

ambiguousFeature

—

java.lang.String

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

warnIfAmbiguous

—

java.lang.Boolean

False

—

DocumentFrequencyBank

Category: Uncategorized
Framework: GATE
Version: unknown

Document frequency counter derived from corpora and other DFBs

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpora

—

java.util.Set

—

debugMode

—

java.lang.Boolean

—

false

—

idDocumentFeature

—

java.lang.String

—

inputASName

—

java.lang.String

—

inputAnnotationFeature

—

java.lang.String

—

canonical

—

inputAnnotationTypes

—

java.util.Set

—

SingleWord;MultiWord

—

inputBanks

—

java.util.Set

—

languageFeature

—

java.lang.String

—

lang

—

scoreProperty

—

java.lang.String

—

documentFrequency

—

segmentAnnotationType

—

java.lang.String

—

DoubleMetaphonePhoneticTranscriptor

Category: Uncategorized
Framework: DKPro Core (UIMA)
Version: 1.8.0

Double-Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

ElementMapper

Category: Uncategorized
Framework: AlvisNLP
Version:

Maps elements according to a collection of mapping elements.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

entries

—

alvisnlp.corpus.expressions.Expression

True

—

form

—

alvisnlp.corpus.expressions.Expression

True

—

ignoreCase

—

java.lang.Boolean

False

—

key

—

alvisnlp.corpus.expressions.Expression

True

—

operator

—

org.bibliome.alvisnlp.modules.mapper.MappingOperator

True

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

targetFeatures

—

java.lang.String[]

True

—

values

—

alvisnlp.corpus.expressions.Expression[]

True

—

ElementProjector

Category: Uncategorized
Framework: AlvisNLP
Version:

Searches for entries in a dictionary generated by an expression.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

errorDuplicateValues

—

java.lang.Boolean

False

—

features

—

alvisnlp.module.types.ExpressionMapping

True

—

ignoreCase

—

java.lang.Boolean

False

—

ignoreDiacritics

—

java.lang.Boolean

False

—

ignoreWhitespace

—

java.lang.Boolean

False

—

key

—

alvisnlp.corpus.expressions.Expression

True

—

multipleValueAction

—

org.bibliome.alvisnlp.modules.projectors.MultipleValueAction

True

—

normalizeSpace

—

java.lang.Boolean

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

subject

—

org.bibliome.alvisnlp.modules.projectors.Subject

True

—

targetLayerName

—

java.lang.String

True

—

values

—

alvisnlp.corpus.expressions.Expression

True

—

ElementProjector2

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

action

—

alvisnlp.corpus.expressions.Expression

True

—

active

—

alvisnlp.corpus.expressions.Expression

True

—

addToLayer

—

java.lang.Boolean

False

—

allUpperCaseInsensitive

—

java.lang.Boolean

False

—

allowJoined

—

java.lang.Boolean

False

—

caseInsensitive

—

java.lang.Boolean

False

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

createAnnotations

—

java.lang.Boolean

False

—

createDocuments

—

java.lang.Boolean

False

—

createRelations

—

java.lang.Boolean

False

—

createSections

—

java.lang.Boolean

False

—

createTuples

—

java.lang.Boolean

False

—

deleteElements

—

java.lang.Boolean

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

entries

—

alvisnlp.corpus.expressions.Expression

True

—

ignoreDiacritics

—

java.lang.Boolean

False

—

joinDash

—

java.lang.Boolean

False

—

key

—

alvisnlp.corpus.expressions.Expression

True

—

matchStartCaseInsensitive

—

java.lang.Boolean

False

—

multipleEntryBehaviour

—

org.bibliome.alvisnlp.modules.trie.MultipleEntryBehaviour

True

—

removeFromLayer

—

java.lang.Boolean

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

setArguments

—

java.lang.Boolean

False

—

setFeatures

—

java.lang.Boolean

False

—

skipConsecutiveWhitespaces

—

java.lang.Boolean

False

—

skipWhitespace

—

java.lang.Boolean

False

—

subject

—

org.bibliome.alvisnlp.modules.trie.Subject

True

—

targetLayerName

—

java.lang.String

True

—

trieSink

—

org.bibliome.util.files.OutputFile

False

—

trieSource

—

org.bibliome.util.files.InputFile

False

—

wordStartCaseInsensitive

—

java.lang.Boolean

False

—

EngLemmatiser

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

English lemmatiser which is adapted from WordNet. From dragontools/Banner toolkit.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

DisableVerbAdjective

—

Boolean

True

—

false

—

IndexLookup

—

Boolean

True

—

false

—

Feature Generator

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Generates a list of user-defined observations for each token. Token and sequence boundaries are also parametrised. The output of this component is useful for machine learning components.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

FeatureDefinitions

—

String

True

—

true

—

SequenceAnnotationType

—

String

True

—

false

—

TokenAnnotationType

—

String

True

—

false

—

FileMapper

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Maps the value of an annoation feature according to a mapping file.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

ignoreCase

—

java.lang.Boolean

False

—

mappedLayerName

—

java.lang.String

True

—

mappingFile

—

org.bibliome.util.streams.SourceStream

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

separator

—

java.lang.Character

True

—

sourceFeature

—

java.lang.String

True

—

targetFeatures

—

java.lang.String[]

True

—

FileMapper2

Category: Uncategorized
Framework: AlvisNLP
Version:

Maps elements according to a tab-separated mapping file.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

form

—

alvisnlp.corpus.expressions.Expression

True

—

ignoreCase

—

java.lang.Boolean

False

—

keyColumn

—

java.lang.Integer

True

—

mappingFile

—

org.bibliome.util.streams.SourceStream

True

—

operator

—

org.bibliome.alvisnlp.modules.mapper.MappingOperator

True

—

separator

—

java.lang.Character

True

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

targetFeatures

—

java.lang.String[]

True

—

FreelingMorpho

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Performs tokenisation, and determines possible lemmas and POS tags for each token, with confidence scores. Operates on English (en). Spanish (es) and Catalan (ca), Welsh (cy), Galician (gl), Italian (it) and Portuguese (pt) by setting the "language" parameter (default is English).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

—

String

True

—

false

—

GATE Composite document

Category: Uncategorized
Framework: GATE
Version: unknown

GATE Composite document.

Gazetteer List Collector

Category: Uncategorized
Framework: GATE
Version: unknown

Gazetteer lists collector.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationTypes

—

java.util.ArrayList

—

Organization;Person;Location;Date

—

true

document

—

gate.Document

—

true

gazetteer

—

gate.creole.gazetteer.Gazetteer

—

true

markupASName

—

java.lang.String

—

Key

—

true

theLanguage

—

java.lang.String

—

true

GermanSeparatedParticleAnnotator

Category: Uncategorized
Framework: DKPro Core (UIMA)
Version: 1.8.0

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. This Annotator deals with German particle verbs. Particle verbs consist of a particle and a stem, e.g. anfangen = an+fangen There are many usages of German particle verbs where the stem and the particle are separated, e.g., Wir fangen gleich an. The TreeTagger lemmatizes the verb stem as "fangen" and the separated particle as "an", the proper verblemma "anfangen" is thus not available as an annotation. The GermanSeparatedParticleAnnotator replaces the lemma of the stem of particle-verbs (e.g., fangen) by the proper verb lemma (e.g. anfangen) and leaves the lemma of the separated particle unchanged.

Groovy support for GATE

Category: Uncategorized
Framework: GATE
Version: unknown

Hindi Main Grammar

Category: Uncategorized
Framework: GATE
Version: unknown

A module for executing Jape grammars

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

resources/grammar/main.jape

—

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

Hindi OrthoMatcher

Category: Uncategorized
Framework: GATE
Version: unknown

Hindi Orthomatcher

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
annotationSetName	—	java.lang.String	—	—	—	true
annotationTypes	—	java.util.ArrayList	—	Organization;Person;Location;Date	—	true
caseSensitive	—	java.lang.Boolean	—	false	—	—
definitionFileURL	—	java.net.URL	—	resources/orthomatcher/listsNM.def	—	—
document	—	gate.Document	—	—	—	true
encoding	—	java.lang.String	—	UTF-8	—	—
extLists	—	java.lang.Boolean	—	true	—	—
highPrecisionOrgs	—	java.lang.Boolean	—	false	—	—
minimumNicknameLikelihood	—	java.lang.Double	—	0.50	—	—
organizationType	—	java.lang.String	—	Organization	—	—
personType	—	java.lang.String	—	Person	—	—
processUnknown	—	java.lang.Boolean	—	true	—	—

Hindi Tokeniser Postprocessor

Category: Uncategorized
Framework: GATE
Version: unknown

A module for executing Jape grammars

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

resources/tokeniser/join.jape

—

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

HyponymyTermbank

Category: Uncategorized
Framework: GATE
Version: unknown

TermRaider Termbank derived from head/string hyponymy

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpora

—

java.util.Set

—

debugMode

—

java.lang.Boolean

—

false

—

idDocumentFeature

—

java.lang.String

—

inputASName

—

java.lang.String

—

inputAnnotationFeature

—

java.lang.String

—

canonical

—

inputAnnotationTypes

—

java.util.Set

—

SingleWord;MultiWord

—

inputHeadFeatures

—

java.util.List

—

languageFeature

—

java.lang.String

—

lang

—

normalization

—

gate.termraider.modes.Normalization

—

Sigmoid

—

scoreProperty

—

java.lang.String

—

kyotoDomainRelevance

—

[[IOTestRunner$Validator]] ==== IOTestRunner$Validator

Category: Uncategorized
Framework: DKPro Core (UIMA)
Version: 1.8.0

Descriptor automatically generated by uimaFIT

InsertContents

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

insert

—

alvisnlp.corpus.expressions.Expression

True

—

offset

—

alvisnlp.corpus.expressions.Expression

True

—

points

—

alvisnlp.corpus.expressions.Expression

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

userFunctions

—

org.bibliome.alvisnlp.library.UserFunction[]

True

—

Kleio Search

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 0.3

Uses the Keio service to fetch MEDLINE abstracts matching a specified query. Kleio is available at http://www.nactem.ac.uk/Kleio/

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

query

Kleio query

String

True

—

false

—

recentFirst

If true, results will be sorted by the date of publication in decreasing order. Otherwise, they will be sorted by relevance.

Boolean

False

—

false

—

LBJ Named Entity Recognizer

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

A wrapper for the Illinois Named Entity Tagger

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

BeamSize

—

Integer

False

—

false

—

BrownClusterFiles

Set of resource files

String

True

—

true

—

BrownClusterThresholds

Settings per cluster resource file

Integer

True

—

true

—

BrownIsLowercase

Setting per cluster resource

String

True

—

true

—

ChunkScheme

Whether BIO, BILOU, IOB2, etc.

String

True

—

false

—

EmbeddingDimensionalities

—

Integer

False

—

false

—

Features

Which features to use

String

True

—

true

—

ForceNewSentenceOnLineBreaks

—

Boolean

False

—

false

—

InferenceMethod

—

String

False

—

false

—

IsLowercaseWordEmbeddings

—

Boolean

False

—

false

—

KeepOriginalFileTokenizationAndSentenceSplitting

—

Boolean

False

—

false

—

Labels

Which labels to output

String

True

—

true

—

LinkScoreThreshold

—

Float

False

—

false

—

MinWordAppThresholdsForEmbeddings

—

Integer

False

—

false

—

NormalizationConstantsForEmbeddings

—

Float

False

—

false

—

NormalizationMethodsForEmbeddings

—

String

False

—

false

—

NormalizeTitleText

—

Boolean

True

—

false

—

PredictionConfidenceThreshold

—

Integer

False

—

false

—

ThresholdPrediction

—

Boolean

False

—

false

—

TokenizationScheme

—

String

True

—

false

—

LayerComparator

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Compares annotations in two different layers.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

outFile

—

org.bibliome.util.streams.TargetStream

True

—

predictedLayerName

—

java.lang.String

True

—

referenceLayerName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

Linguistic Simplifier

Category: Uncategorized
Framework: GATE
Version: unknown

A processing resource that takes document and corpus parameters

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerURL

—

java.net.URL

—

resources/gazetteer/lists.def

—

japeURL

—

java.net.URL

—

resources/jape/main.jape

—

nounVerbMapURL

—

java.net.URL

—

resources/noun_verb.csv

—

wordNet

—

gate.wordnet.WordNet

—

true

Linguistic Simplifier

Category: Uncategorized
Framework: GATE
Version: unknown

Example application for the linguistic simplifier

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Lucene IR Engine

Category: Uncategorized
Framework: GATE
Version: unknown

Lupedia Service PR

Category: Uncategorized
Framework: GATE
Version: unknown

Runs a lupedia annotation service on a GATE document

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
caseSensitive	—	java.lang.Boolean	—	true	—	true
corpus	—	gate.Corpus	—	—	—	true
datasets	—	java.util.List	—	Person;Event;Place;Organisation;Work	—	true
document	—	gate.Document	—	—	—	true
keepFirstAndLongestMatch	—	java.lang.Boolean	—	true	—	true
keepHighest	—	java.lang.Boolean	—	true	—	true
keepSpecific	—	java.lang.Boolean	—	true	—	true
lang	—	gate.lupedia.Language	—	en	—	true
outputASName	—	java.lang.String	—	—	—	true
singleGreedyMatch	—	java.lang.Boolean	—	false	—	true
skipShortWords	—	java.lang.Boolean	—	true	—	true
skipStopWords	—	java.lang.Boolean	—	true	—	true
threshold	—	java.lang.Double	—	0.70	—	true

[[Majority-vote_consensus_builder_(annotation)]] ==== Majority-vote consensus builder (annotation)

Category: Uncategorized
Framework: GATE
Version: unknown

Process results of a crowd annotation task to find where annotators agree and disagree.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

consensusASName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

disputeASName

—

java.lang.String

—

crowdDisputed

—

true

document

—

gate.Document

—

true

minimumAgreement

—

java.lang.Integer

—

true

resultASName

—

java.lang.String

—

crowdResults

—

true

resultAnnotationType

—

java.lang.String

—

true

MergeLayers

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Creates a new layer in each section containing all annotations in source layers.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sourceLayerNames

—

java.lang.String[]

True

—

targetLayerName

—

java.lang.String

True

—

MergeSections

Category: Uncategorized
Framework: AlvisNLP
Version:

Merge several sections into a single one.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

fragmentLayerName

—

java.lang.String

False

—

fragmentSelection

—

org.bibliome.alvisnlp.modules.clone.FragmentSelection

True

—

fragmentSeparator

—

java.lang.String

True

—

removeSections

—

java.lang.Boolean

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sectionSeparator

—

java.lang.String

True

—

sectionsLayerName

—

java.lang.String

False

—

targetSectionName

—

java.lang.String

True

—

MetaMap Annotator

Category: Uncategorized
Framework: GATE
Version: unknown

This plugin uses the MetaMap Java API to send GATE document content to MetaMap skrmedpostctl server and PrologBeans mmserver instances running on the given machine/port

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotNormalize

—

gate.metamap.AnnotNormalizeMode

—

None

—

true

annotateNegEx

—

java.lang.Boolean

—

false

—

true

annotatePhrases

—

java.lang.Boolean

—

false

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

excludeIfContains

—

java.util.ArrayList

—

true

excludeIfWithin

—

java.util.ArrayList

—

true

inputASName

—

java.lang.String

—

true

inputASTypeFeature

—

java.lang.String

—

true

inputASTypes

—

java.util.ArrayList

—

true

metaMapOptions

—

java.lang.String

—

-Xy

—

true

outputASName

—

java.lang.String

—

true

outputASType

—

java.lang.String

—

MetaMap

—

true

outputMode

—

gate.metamap.OutputMode

—

HighestMappingOnly

—

true

taggerMode

—

gate.metamap.TaggerMode

—

CoReference

—

true

MetaphonePhoneticTranscriptor

Category: Uncategorized
Framework: DKPro Core (UIMA)
Version: 1.8.0

Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

MutationFinder

Category: Uncategorized
Framework: GATE
Version: unknown

GATE MutationFinder Wrapper

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

regexURL

—

java.net.URL

—

resources/regex.txt

—

NGramAnnotator

Category: Uncategorized
Framework: DKPro Core (UIMA)
Version: 1.8.0

N-gram annotator.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

The length of the n-grams to generate (the "n" in n-gram).

Integer

True

—

false

—

NGrams

Category: Uncategorized
Framework: AlvisNLP
Version: 2012-04-30

Computes annotation n-grams.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

keepAnnotations

—

java.lang.String[]

True

—

maxNGramSize

—

java.lang.Integer

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

False

—

targetLayerName

—

java.lang.String

True

—

tokenLayerName

—

java.lang.String

True

—

NeMine

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 0.0.1-SNAPSHOT

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

threshold

—

Float

True

—

false

—

NewCount

Category: Uncategorized
Framework: AlvisNLP
Version: 2012-04-30

Counts element occurrences and writes the results in a file, including tfidf.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

countFile

—

org.bibliome.util.streams.TargetStream

False

—

documents

—

alvisnlp.corpus.expressions.Expression

True

—

featureKey

—

java.lang.String

True

—

headers

—

java.lang.Boolean

False

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

tfidfFile

—

org.bibliome.util.streams.TargetStream

False

—

OBOMapper

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

ancestorsFeature

—

java.lang.String

False

—

childrenFeature

—

java.lang.String

False

—

form

—

alvisnlp.corpus.expressions.Expression

True

—

idFeature

—

java.lang.String

False

—

idKeys

—

java.lang.Boolean

False

—

ignoreCase

—

java.lang.Boolean

False

—

keepDBXref

—

java.lang.Boolean

False

—

nameFeature

—

java.lang.String

False

—

oboFiles

—

java.lang.String[]

True

—

operator

—

org.bibliome.alvisnlp.modules.mapper.MappingOperator

True

—

parentsFeature

—

java.lang.String

False

—

pathFeature

—

java.lang.String

False

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

versionFeature

—

java.lang.String

False

—

OBOProjector

Category: Uncategorized
Framework: AlvisNLP
Version:

Projects OBO terms and synonyms on sections.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

allUpperCaseInsensitive

—

java.lang.Boolean

False

—

allowJoined

—

java.lang.Boolean

False

—

ancestorsFeature

—

java.lang.String

False

—

caseInsensitive

—

java.lang.Boolean

False

—

childrenFeature

—

java.lang.String

False

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

idFeature

—

java.lang.String

False

—

ignoreDiacritics

—

java.lang.Boolean

False

—

joinDash

—

java.lang.Boolean

False

—

keepDBXref

—

java.lang.Boolean

False

—

matchStartCaseInsensitive

—

java.lang.Boolean

False

—

multipleEntryBehaviour

—

org.bibliome.alvisnlp.modules.trie.MultipleEntryBehaviour

True

—

nameFeature

—

java.lang.String

False

—

oboFiles

—

java.lang.String[]

True

—

parentsFeature

—

java.lang.String

False

—

pathFeature

—

java.lang.String

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

skipConsecutiveWhitespaces

—

java.lang.Boolean

False

—

skipWhitespace

—

java.lang.Boolean

False

—

subject

—

org.bibliome.alvisnlp.modules.trie.Subject

True

—

targetLayerName

—

java.lang.String

True

—

trieSink

—

org.bibliome.util.files.OutputFile

False

—

trieSource

—

org.bibliome.util.files.InputFile

False

—

versionFeature

—

java.lang.String

False

—

wordStartCaseInsensitive

—

java.lang.Boolean

False

—

OWLIM Ontology

Category: Uncategorized
Framework: GATE
Version: unknown

Ontology created as a temporary OWLIM3 in-memory repository

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

baseURI

—

java.lang.String

—

dataDirectoryURL

—

java.net.URL

—

loadImports

—

java.lang.Boolean

—

true

—

mappingsURL

—

java.net.URL

—

n3URL

—

java.net.URL

—

ntriplesURL

—

java.net.URL

—

persistent

—

java.lang.Boolean

—

false

—

rdfXmlURL

—

java.net.URL

—

turtleURL

—

java.net.URL

—

OWLIM Ontology DEPRECATED

Category: Uncategorized
Framework: GATE
Version: unknown

Ontology created as a temporary OWLIM3 in-memory repository, for backwards compatibility only

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

baseURI

—

java.lang.String

—

dataDirectoryURL

—

java.net.URL

—

defaultNameSpace

—

java.lang.String

—

loadImports

—

java.lang.Boolean

—

true

—

mappingsURL

—

java.net.URL

—

n3URL

—

java.net.URL

—

ntriplesURL

—

java.net.URL

—

persistent

—

java.lang.Boolean

—

false

—

rdfXmlURL

—

java.net.URL

—

turtleURL

—

java.net.URL

—

OntoReif

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

OpenNLPNEDetector

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Detects named entities in text and creates corresponding entity annotations that span the found entities. Uses the OpenNLP MaxEnt named entity Detector. Each entity class has a separate MaxEnt model file. All model files must be stored in a single model file directory and use the following naming convention: "class.bin.gz", where "class" is the entity class name and ".bin.gz" must appear as shown, e.g., "person.bin.gz". This analysis engine takes a parameter called "EntityTypeMapping" which maps each entity class name to an entity annotation type. The entity class name must match a model file in the model file directory, and the entity annotation type must be defined in the type system and have a corresponding JCas Java class.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

EntityTypeMappings

Mapping from entity names (obtained from the model filename) to the JCas class for the corresponding annotation. Each mapping string is of the form "name,class", i.e., the entity type name followed by a comma followed by the annotation class.

String

False

—

true

—

ModelDirectory

—

String

True

—

false

—

OpenNLPSentenceDetector

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Detect sentence boundaries and create sentence annotations that span these boundaries. Uses the OpenNLP MaxEnt Sentence Detector.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ModelFile

Filename of the model file.

String

True

—

false

—

OrthoRef

Category: Uncategorized
Framework: GATE
Version: unknown

An orthographic coreferencer

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
annotationSetName	—	java.lang.String	—	—	—	true
configFileUrl	—	java.net.URL	—	resources/default-config.coref.xml	—	—
corpus	—	gate.Corpus	—	—	—	true
document	—	gate.Document	—	—	—	true
maxLookBehind	—	java.lang.Integer	—	10	—	true

OscarMER

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Runs Oscar 3 with maximum entropy based recogniser with syntactic tokens as input

PMI Bank

Category: Uncategorized
Framework: GATE
Version: unknown

Pointwise Mutual Information from corpora

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
allowOverlapCollocations	—	java.lang.Boolean	—	false	—	—
corpora	—	java.util.Set	—	—	—	—
debugMode	—	java.lang.Boolean	—	false	—	—
innerAnnotationTypes	—	java.util.Set	—	Entity	—	—
inputASName	—	java.lang.String	—	—	—	—
inputAnnotationFeature	—	java.lang.String	—	canonical	—	—
languageFeature	—	java.lang.String	—	lang	—	—
outerAnnotationType	—	java.lang.String	—	Sentence	—	—
outerAnnotationWindow	—	java.lang.Integer	—	2	—	—
requireTypeDifference	—	java.lang.Boolean	—	false	—	—
scoreProperty	—	java.lang.String	—	pmiScore	—	—

[[PMI_Example_(English)]] ==== PMI Example (English)

Category: Uncategorized
Framework: GATE
Version: unknown

Example application for the PMI (pointwise mutual information) tool

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

PatternMatcher

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Matches a regular expression-like pattern on the sequence of annotations in a given layer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

actions

—

org.bibliome.alvisnlp.modules.pattern.action.MatchAction[]

True

—

active

—

alvisnlp.corpus.expressions.Expression

True

—

annotationComparator

—

alvisnlp.corpus.AnnotationComparator

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

layerName

—

java.lang.String

True

—

overlappingBehaviour

—

org.bibliome.alvisnlp.modules.pattern.OverlappingBehaviour

True

—

pattern

—

org.bibliome.alvisnlp.modules.pattern.ElementPattern

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

ProminentConceptReporter

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

conceptAnnotations

—

alvisnlp.corpus.expressions.Expression

True

—

conceptId

—

alvisnlp.corpus.expressions.Expression

True

—

documents

—

alvisnlp.corpus.expressions.Expression

True

—

sectionName

—

java.lang.String

True

—

Quality Assurance PR

Category: Uncategorized
Framework: GATE
Version: unknown

The Quality Assurance PR provides a functionality of the Corpus QA Tool in GATE Developer

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationTypes

—

java.util.List

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

featureNames

—

java.util.List

—

true

keyASName

—

java.lang.String

—

Key

—

true

measure

—

gate.qa.Measure

—

true

outputFolderUrl

—

java.net.URL

—

true

responseASName

—

java.lang.String

—

true

QuickHTML

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

classFeature

—

java.lang.String

True

—

colors

—

java.lang.String[]

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

features

—

java.lang.String[]

False

—

layers

—

java.lang.String[]

False

—

outDir

—

org.bibliome.util.files.OutputDirectory

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

tagFeature

—

java.lang.String

False

—

RO_FDGBank

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.1

This reader performs the transformation of the CONLL tab separated text format to the CAS ConllDependency format.

Reference Evaluator

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Reports annotation performance comparing views (sofas) to one selected reference view.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

OutputFile

—

String

True

—

false

—

RegExp

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-09-27

Matches a regular expression on sections contents and create an annotation for each match.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

pattern

—

java.util.regex.Pattern

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

targetLayerName

—

java.lang.String

True

—

Regex Annotator

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.1

Annotates spans of text based on a custom regular expression.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationType

Fully qualified type of annotations to be produced. The type must extend uima.tcas.Annotation or be uima.tcas.Annotation.

String

True

—

false

—

caseSensitive

—

Boolean

False

—

false

—

findFirstOnly

If true, matching will stop after encountering the first match.

Boolean

False

—

false

—

multilineMatching

If true then the "^" and "$" symbols match the beginnngs and ends of lines. Otherwise, they match the beginning and end of the entire text.

Boolean

False

—

false

—

regularExpression

A valid regular expression.

String

True

—

false

—

RemoveContents

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

stripLayerName

—

java.lang.String

True

—

userFunctions

—

org.bibliome.alvisnlp.library.UserFunction[]

True

—

RemoveEquivalent

Category: Uncategorized
Framework: AlvisNLP
Version:

Removes duplicate elements.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

equivalency

—

alvisnlp.corpus.expressions.Expression

True

—

priority

—

alvisnlp.corpus.expressions.Expression

False

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

RemoveOverlaps

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Removes overlapping annotations from a given layer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

annotationComparator

—

alvisnlp.corpus.AnnotationComparator

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

layerName

—

java.lang.String

True

—

removeEqual

—

java.lang.Boolean

True

—

removeIncluded

—

java.lang.Boolean

True

—

removeOverlapping

—

java.lang.Boolean

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

Romanian Transducer

Category: Uncategorized
Framework: GATE
Version: unknown

A module for executing Jape grammars

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

resources/Grammar/main.jape

—

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

SFTP BioNLP Shared Task Data Provider

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Reads a corpus in BioNLP Shared Task format from a remote directory on a user-specified server via SFTP.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

Password

—

String

True

—

false

—

RemoteDirectory

—

String

True

—

false

—

ServerURL

—

String

True

—

false

—

Username

—

String

True

—

false

—

SQLImport

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

action

—

alvisnlp.corpus.expressions.Expression

True

—

active

—

alvisnlp.corpus.expressions.Expression

True

—

addToLayer

—

java.lang.Boolean

False

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

createAnnotations

—

java.lang.Boolean

False

—

createDocuments

—

java.lang.Boolean

False

—

createRelations

—

java.lang.Boolean

False

—

createSections

—

java.lang.Boolean

False

—

createTuples

—

java.lang.Boolean

False

—

deleteElements

—

java.lang.Boolean

False

—

parameters

—

org.bibliome.alvisnlp.modules.sql.SQLParameter[]

True

—

password

—

java.lang.String

True

—

query

—

java.lang.String

True

—

removeFromLayer

—

java.lang.Boolean

False

—

setArguments

—

java.lang.Boolean

False

—

setFeatures

—

java.lang.Boolean

False

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

url

—

java.lang.String

True

—

username

—

java.lang.String

True

—

SeSMig

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Detects sentence boundaries and creates one annotation for each sentence.This module assumes WoSMig processed the same sections.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

eosStatusFeature

—

java.lang.String

True

—

formFeature

—

java.lang.String

True

—

noBreakLayerName

—

java.lang.String

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

strongPunctuations

—

java.lang.String

True

—

targetLayerName

—

java.lang.String

True

—

typeFeature

—

java.lang.String

True

—

wordLayerName

—

java.lang.String

True

—

Search Results

Category: Uncategorized
Framework: GATE
Version: unknown

Viewer for IR search results

SearchPR

Category: Uncategorized
Framework: GATE
Version: unknown

Provides IR functionality.

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
corpus	—	gate.creole.ir.IndexedCorpus	—	—	—	true
fieldNames	—	java.util.ArrayList	—	*	—	true
limit	—	java.lang.Integer	—	20	—	true
query	—	java.lang.String	—	—	—	true
searcherClassName	—	java.lang.String	—	gate.creole.ir.lucene.LuceneSearch	—	true

Sequence_Impl

Category: Uncategorized
Framework: AlvisNLP
Version:

Sequence of modules.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

[[Show/Hide_Resources]] ==== Show/Hide Resources

Category: Uncategorized
Framework: GATE
Version: unknown

Show resources that would otherwise be hidden, e.g. resources created for internal use by other resources

SimpleProjector

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Projects a simple dictionary on sections.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

dictFile

—

org.bibliome.util.streams.SourceStream

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

entryFeatureNames

—

java.lang.String[]

True

—

errorDuplicateValues

—

java.lang.Boolean

False

—

ignoreCase

—

java.lang.Boolean

False

—

ignoreDiacritics

—

java.lang.Boolean

False

—

ignoreWhitespace

—

java.lang.Boolean

False

—

multipleValueAction

—

org.bibliome.alvisnlp.modules.projectors.MultipleValueAction

True

—

normalizeSpace

—

java.lang.Boolean

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

separator

—

java.lang.Character

True

—

skipBlankLines

—

java.lang.Boolean

True

—

strictColumnNumber

—

java.lang.Boolean

False

—

subject

—

org.bibliome.alvisnlp.modules.projectors.Subject

True

—

targetLayerName

—

java.lang.String

True

—

trimColumns

—

java.lang.Boolean

True

—

SimpleProjector2

Category: Uncategorized
Framework: AlvisNLP
Version:

Projects a simple dictionary on sections.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

allUpperCaseInsensitive

—

java.lang.Boolean

False

—

allowJoined

—

java.lang.Boolean

False

—

caseInsensitive

—

java.lang.Boolean

False

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

dictFile

—

org.bibliome.util.streams.SourceStream

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

ignoreDiacritics

—

java.lang.Boolean

False

—

joinDash

—

java.lang.Boolean

False

—

keyIndex

—

java.lang.Integer[]

True

—

matchStartCaseInsensitive

—

java.lang.Boolean

False

—

multipleEntryBehaviour

—

org.bibliome.alvisnlp.modules.trie.MultipleEntryBehaviour

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

separator

—

java.lang.Character

True

—

skipBlank

—

java.lang.Boolean

False

—

skipConsecutiveWhitespaces

—

java.lang.Boolean

False

—

skipEmpty

—

java.lang.Boolean

False

—

skipWhitespace

—

java.lang.Boolean

False

—

strictColumnNumber

—

java.lang.Boolean

True

—

subject

—

org.bibliome.alvisnlp.modules.trie.Subject

True

—

targetLayerName

—

java.lang.String

True

—

trieSink

—

org.bibliome.util.files.OutputFile

False

—

trieSource

—

org.bibliome.util.files.InputFile

False

—

trimColumns

—

java.lang.Boolean

False

—

valueFeatures

—

java.lang.String[]

True

—

wordStartCaseInsensitive

—

java.lang.Boolean

False

—

SoundexPhoneticTranscriptor

Category: Uncategorized
Framework: DKPro Core (UIMA)
Version: 1.8.0

Soundex phonetic transcription based on Apache Commons Codec. Works for English.

Species

Category: Uncategorized
Framework: AlvisNLP
Version:

Calls the Species taxon tagger.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

speciesDir

—

org.bibliome.util.files.InputDirectory

True

—

targetLayerName

—

java.lang.String

True

—

taxidFeature

—

java.lang.String

False

—

SplitOverlaps

Category: Uncategorized
Framework: AlvisNLP
Version:

Splits overlapping annotations.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

checkedlayerNames

—

java.lang.String[]

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

indexFeatureName

—

java.lang.String

False

—

modifiedlayerName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

TermRaider English Term Extraction

Category: Uncategorized
Framework: GATE
Version: unknown

Example application showing typical set-up for the TermRaider tools

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Termbank Score Copier

Category: Uncategorized
Framework: GATE
Version: unknown

Copy scores from Termbanks back to their source annotations

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

docFrequencyFeature

—

java.lang.String

—

docFrequency

—

true

document

—

gate.Document

—

true

frequencyFeature

—

java.lang.String

—

frequency

—

true

termbank

—

gate.termraider.bank.AbstractTermbank

—

true

TextRazor Service PR

Category: Uncategorized
Framework: GATE
Version: unknown

Runs the TextRazor annotation service (http://textrazor.com) on a GATE document

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiKey

—

java.lang.String

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

outputASName

—

java.lang.String

—

true

TfIdfTermbank

Category: Uncategorized
Framework: GATE
Version: unknown

TermRaider Termbank derived from vectors in document features

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpora

—

java.util.Set

—

debugMode

—

java.lang.Boolean

—

false

—

docFreqSource

—

gate.termraider.bank.DocumentFrequencyBank

—

idDocumentFeature

—

java.lang.String

—

idfCalculation

—

gate.termraider.modes.IdfCalculation

—

LogarithmicScaled

—

inputASName

—

java.lang.String

—

inputAnnotationFeature

—

java.lang.String

—

canonical

—

inputAnnotationTypes

—

java.util.Set

—

SingleWord;MultiWord

—

languageFeature

—

java.lang.String

—

lang

—

normalization

—

gate.termraider.modes.Normalization

—

Sigmoid

—

scoreProperty

—

java.lang.String

—

tfIdf

—

tfCalculation

—

gate.termraider.modes.TfCalculation

—

Logarithmic

—

TfidfAnnotator

Category: Uncategorized
Framework: DKPro Core (UIMA)
Version: 1.8.0

This component adds Tfidf annotations consisting of a term and a tfidf weight.
The annotator is type agnostic concerning the input annotation, so you have to specify the annotation type and string representation. It uses a pre-serialized DfStore, which can be created using the TfidfConsumer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

featurePath

This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path.

String

True

—

false

—

lowercase

If set to true, the whole text is handled in lower case.

Boolean

False

—

false

—

tfdfPath

Provide the path to the Df-Model. When a shared SharedDfModel is bound to this annotator, this is ignored.

String

False

—

false

—

weightingModeIdf

The model for inverse document frequency weighting. Invoke toString() on an enum of WeightingModeIdf for setup. Default value is "NORMAL" yielding an unweighted idf.

String

False

—

false

—

weightingModeTf

The model for term frequency weighting. Invoke toString() on an enum of WeightingModeTf for setup. Default value is "NORMAL" yielding an unweighted tf.

String

False

—

false

—

TomapProjector

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

allUpperCaseInsensitive

—

java.lang.Boolean

False

—

allowJoined

—

java.lang.Boolean

False

—

caseInsensitive

—

java.lang.Boolean

False

—

conceptFeature

—

java.lang.String

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

explanationFeaturePrefix

—

java.lang.String

False

—

ignoreDiacritics

—

java.lang.Boolean

False

—

joinDash

—

java.lang.Boolean

False

—

lemmaKeys

—

java.lang.Boolean

False

—

matchStartCaseInsensitive

—

java.lang.Boolean

False

—

multipleEntryBehaviour

—

org.bibliome.alvisnlp.modules.trie.MultipleEntryBehaviour

True

—

onlyMNP

—

java.lang.Boolean

False

—

scoreFeature

—

java.lang.String

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

skipConsecutiveWhitespaces

—

java.lang.Boolean

False

—

skipWhitespace

—

java.lang.Boolean

False

—

subject

—

org.bibliome.alvisnlp.modules.trie.Subject

True

—

targetLayerName

—

java.lang.String

True

—

tomapClassifier

—

org.bibliome.alvisnlp.modules.tomap.TomapClassifier

True

—

trieSink

—

org.bibliome.util.files.OutputFile

False

—

trieSource

—

org.bibliome.util.files.InputFile

False

—

wordStartCaseInsensitive

—

java.lang.Boolean

False

—

yateaFile

—

org.bibliome.util.streams.SourceStream

True

—

TomapTrain

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

bioYatea

—

java.lang.Boolean

False

—

conceptIdentifier

—

alvisnlp.corpus.expressions.Expression

True

—

configDir

—

org.bibliome.util.files.InputDirectory

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

formFeature

—

java.lang.String

True

—

language

—

java.lang.String

False

—

lemmaFeature

—

java.lang.String

True

—

localeDir

—

org.bibliome.util.files.InputDirectory

False

—

outFile

—

org.bibliome.util.streams.TargetStream

True

—

outputDir

—

org.bibliome.util.files.OutputDirectory

False

—

perlLib

—

java.lang.String

False

—

posFeature

—

java.lang.String

True

—

postProcessingConfig

—

org.bibliome.util.files.InputFile

False

—

postProcessingOutput

—

org.bibliome.util.files.OutputFile

False

—

rcFile

—

org.bibliome.util.streams.SourceStream

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

True

—

suffix

—

java.lang.String

False

—

wordLayerName

—

java.lang.String

True

—

workingDir

—

org.bibliome.util.files.WorkingDirectory

True

—

yateaDefaultConfig

—

alvisnlp.module.types.Mapping

True

—

yateaExecutable

—

org.bibliome.util.files.ExecutableFile

True

—

yateaOptions

—

alvisnlp.module.types.Mapping

True

—

TyDIProjector

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Projects terms from a TiDI export.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

acronymsFile

—

org.bibliome.util.streams.SourceStream

False

—

active

—

alvisnlp.corpus.expressions.Expression

True

—

canonicalFormFeature

—

java.lang.String

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

errorDuplicateValues

—

java.lang.Boolean

False

—

ignoreCase

—

java.lang.Boolean

False

—

ignoreDiacritics

—

java.lang.Boolean

False

—

ignoreWhitespace

—

java.lang.Boolean

False

—

lemmaFile

—

org.bibliome.util.streams.SourceStream

True

—

mergeFile

—

org.bibliome.util.streams.SourceStream

True

—

multipleValueAction

—

org.bibliome.alvisnlp.modules.projectors.MultipleValueAction

True

—

normalizeSpace

—

java.lang.Boolean

False

—

quasiSynonymsFile

—

org.bibliome.util.streams.SourceStream

True

—

saveDictFile

—

org.bibliome.util.streams.TargetStream

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

subject

—

org.bibliome.alvisnlp.modules.projectors.Subject

True

—

synonymsFile

—

org.bibliome.util.streams.SourceStream

True

—

targetLayerName

—

java.lang.String

True

—

typographicVariationsFile

—

org.bibliome.util.streams.SourceStream

False

—

Type Mapper

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 0.1

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ignoreMissingSourceType

—

Boolean

False

—

false

—

ignoreMissingTargetType

—

Boolean

False

—

false

—

mappingDefinition

Definition of mappings from source types to target types.

String

False

—

false

—

UAICDiacriticsDescriptor

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

UAICLemmav1

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Assigns base forms to tokenised text. Also assigns certain parts of speech

UAICLemmav2

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Assigns base forms in Romanian text, given POS-tagged text.

UAICSegV1

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 1.0

Splits texts into fragments

UMLS Full Dictionary Feature Extractor

Category: Uncategorized
Framework: NaCTeM (UIMA)
Version: 0.0.1-SNAPSHOT

Extracts Dictionary features from a UMLS-sourced dictionary

WapitiLabel

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

commandLineOptions

—

java.lang.String[]

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

features

—

alvisnlp.corpus.expressions.Expression[]

True

—

labelFeature

—

java.lang.String

True

—

modelFile

—

org.bibliome.util.files.InputFile

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

False

—

tokenLayerName

—

java.lang.String

True

—

wapitiExecutable

—

org.bibliome.util.files.ExecutableFile

True

—

WapitiTrain

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

commandLineOptions

—

java.lang.String[]

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

features

—

alvisnlp.corpus.expressions.Expression[]

True

—

modelFile

—

org.bibliome.util.files.OutputFile

True

—

modelType

—

java.lang.String

False

—

patternFile

—

org.bibliome.util.files.InputFile

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

False

—

tokenLayerName

—

java.lang.String

True

—

trainAlgorithm

—

java.lang.String

False

—

wapitiExecutable

—

org.bibliome.util.files.ExecutableFile

True

—

WoSMig

Category: Uncategorized
Framework: AlvisNLP
Version: 2010-10-28

Performs word segmentation on section contents.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

annotationComparator

—

alvisnlp.corpus.AnnotationComparator

True

—

annotationTypeFeature

—

java.lang.String

True

—

balancedPunctuations

—

java.lang.String

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

fixedFormLayerName

—

java.lang.String

False

—

fixedType

—

java.lang.String

True

—

punctuationType

—

java.lang.String

True

—

punctuations

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

targetLayerName

—

java.lang.String

True

—

wordType

—

java.lang.String

True

—

WordNet

Category: Uncategorized
Framework: GATE
Version: unknown

WordNet

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

propertyUrl

—

java.net.URL

—

WordNet 1.6

Category: Uncategorized
Framework: GATE
Version: unknown

Princeton WordNet 1.6.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

propertyUrl

—

java.net.URL

—

YateaProjector

Category: Uncategorized
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

errorDuplicateValues

—

java.lang.Boolean

False

—

head

—

java.lang.String

True

—

ignoreCase

—

java.lang.Boolean

False

—

ignoreDiacritics

—

java.lang.Boolean

False

—

ignoreWhitespace

—

java.lang.Boolean

False

—

mnpOnly

—

java.lang.Boolean

False

—

modifier

—

java.lang.String

True

—

monoHeadId

—

java.lang.String

True

—

multipleValueAction

—

org.bibliome.alvisnlp.modules.projectors.MultipleValueAction

True

—

normalizeSpace

—

java.lang.Boolean

False

—

projectLemmas

—

java.lang.Boolean

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

subject

—

org.bibliome.alvisnlp.modules.projectors.Subject

True

—

targetLayerName

—

java.lang.String

True

—

termId

—

java.lang.String

True

—

termLemma

—

java.lang.String

False

—

termPOS

—

java.lang.String

False

—

yateaFile

—

org.bibliome.util.streams.SourceStream

True

—

Zemanta Service PR

Category: Uncategorized
Framework: GATE
Version: unknown

Runs a zemanta annotation service on a GATE document

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiKey

—

java.lang.String

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

numberOfSentencesInBatch

—

java.lang.Integer

—

true

numberOfSentencesInContext

—

java.lang.Integer

—

true

outputASName

—

java.lang.String

—

true

Chunker (7)

ANNIE VP Chunker

Category: Chunker
Framework: GATE
Version: unknown

ANNIE VP Chunker component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

enableDebugging

—

java.lang.Boolean

—

false

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

../ANNIE/resources/VP/VerbGroups.jape

—

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

ILSP Chunker

Category: Chunker
Framework: ILSP (UIMA)
Version: 0.9

Noun Phrase Chunker

Category: Chunker
Framework: GATE
Version: unknown

Ready-made NP chunking application

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Noun Phrase Chunker

Category: Chunker
Framework: GATE
Version: unknown

Implementation of the Ramshaw and Marcus base noun phrase chunker

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationName

—

java.lang.String

—

NounChunk

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

posFeature

—

java.lang.String

—

OpenNLP Chunker

Category: Chunker
Framework: GATE
Version: unknown

Chunker using an OpenNLP maxent model

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

chunkFeature

—

java.lang.String

—

chunk

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

model

—

java.net.URL

—

models/english/en-chunker.bin

—

outputASName

—

java.lang.String

—

true

OpenNlpChunker

Category: Chunker
Framework: DKPro Core (UIMA)
Version: 1.8.0

Chunk annotator using OpenNLP.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ChunkMappingLocation

Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

TreeTaggerChunker

Category: Chunker
Framework: DKPro Core (UIMA)
Version: 1.8.0

Chunk annotator using TreeTagger.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ChunkMappingLocation

Location of the mapping file for chunk tags to UIMA types.

String

False

—

false

—

executablePath

Use this TreeTagger executable instead of trying to locate the executable automatically.

String

False

—

false

—

flushSequence

A sequence to flush the internal TreeTagger buffer and to force it to output the rest of the completed analysis. This is typically just a sequence of like 5-10 full stops (".") separated by new line characters. However, some models may require a different flush sequence, e.g. a short sentence in the respective language. For chunker models, mind that the sentence must also be POS tagged, e.g. Nous-PRO:PER\n….

String

False

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

performanceMode

TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.

Boolean

True

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

Classifier (8)

Entity Classification Job Builder

Category: Classifier
Framework: GATE
Version: unknown

Build a CrowdFlower job asking users to select the right label for entities

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiKey

—

java.lang.String

—

contextASName

—

java.lang.String

—

true

contextAnnotationType

—

java.lang.String

—

Sentence

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

entityASName

—

java.lang.String

—

true

entityAnnotationType

—

java.lang.String

—

Mention

—

true

jobId

—

java.lang.Long

—

true

skipExisting

—

java.lang.Boolean

—

true

—

true

Entity Classification Results Importer

Category: Classifier
Framework: GATE
Version: unknown

Import judgments from a CrowdFlower job created by the Entity Classification Job Builder as GATE annotations.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

answerFeatureName

—

java.lang.String

—

answer

—

true

apiKey

—

java.lang.String

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

entityASName

—

java.lang.String

—

true

entityAnnotationType

—

java.lang.String

—

Mention

—

true

jobId

—

java.lang.Long

—

true

resultASName

—

java.lang.String

—

crowdResults

—

true

resultAnnotationType

—

java.lang.String

—

Mention

—

true

[[Majority-vote_consensus_builder_(classification)]] ==== Majority-vote consensus builder (classification)

Category: Classifier
Framework: GATE
Version: unknown

Process results of a crowd annotation task to find where annotators agree and disagree.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

answerFeatureName

—

java.lang.String

—

answer

—

true

consensusASName

—

java.lang.String

—

crowdConsensus

—

true

corpus

—

gate.Corpus

—

true

disputeASName

—

java.lang.String

—

crowdDisputed

—

true

document

—

gate.Document

—

true

entityAnnotationType

—

java.lang.String

—

Mention

—

true

minimumAgreement

—

java.lang.Integer

—

true

noAgreementAction

—

gate.crowdsource.classification.MajorityVoteClassificationConsensus$Action

—

resolveLocally

—

true

originalEntityASName

—

java.lang.String

—

true

resultASName

—

java.lang.String

—

crowdResults

—

true

resultAnnotationType

—

java.lang.String

—

Mention

—

true

SelectingElementClassifier

Category: Classifier
Framework: AlvisNLP
Version: 2012-04-30

Searches for discrimminating attributes with Weka.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

evaluationFile

—

org.bibliome.util.streams.TargetStream

True

—

evaluator

—

java.lang.String

True

—

evaluatorOptions

—

java.lang.String[]

False

—

examples

—

alvisnlp.corpus.expressions.Expression

True

—

relationDefinition

—

org.bibliome.alvisnlp.modules.classifiers.RelationDefinition

True

—

java.lang.String

False

—

searchOptions

—

java.lang.String[]

False

—

TaggingElementClassifier

Category: Classifier
Framework: AlvisNLP
Version: 2012-04-30

Classifies elements with a Weka classifier.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

classifierFile

—

java.io.File

True

—

evaluationFile

—

org.bibliome.util.streams.TargetStream

False

—

examples

—

alvisnlp.corpus.expressions.Expression

True

—

predictedClassFeatureKey

—

java.lang.String

True

—

relationDefinition

—

org.bibliome.alvisnlp.modules.classifiers.RelationDefinition

True

—

Text Categorization PR

Category: Classifier
Framework: GATE
Version: unknown

Classify text based on a semantic space

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

categoriesURL

—

java.net.URL

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

inputAnnotationType

—

java.lang.String

—

Sentence

—

true

inputFeatureName

—

java.lang.String

—

root

—

true

modelURL

—

java.net.URL

—

outputASName

—

java.lang.String

—

true

outputAnnotationType

—

java.lang.String

—

Sentence

—

true

outputFeatureName

—

java.lang.String

—

Textalytics Text Classification

http://textalytics.com/core/class-1.1

Category: Classifier
Framework: GATE
Version: unknown

Textalytics Text Classification

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiURL

—

java.lang.String

—

—

true

TrainingElementClassifier

Category: Classifier
Framework: AlvisNLP
Version: 2012-04-30

Trains a Weka classifier where examples are elements.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

algorithm

—

java.lang.String

True

—

arffFile

—

org.bibliome.util.streams.TargetStream

False

—

classifierFile

—

java.io.File

True

—

classifierInfoFile

—

org.bibliome.util.streams.TargetStream

False

—

classifierOptions

—

java.lang.String[]

False

—

crossFolds

—

java.lang.Integer

False

—

evaluationFile

—

org.bibliome.util.streams.TargetStream

False

—

examples

—

alvisnlp.corpus.expressions.Expression

True

—

foldFeatureKey

—

java.lang.String

False

—

predictedClassFeatureKey

—

java.lang.String

False

—

randomSeed

—

java.lang.Long

True

—

relationDefinition

—

org.bibliome.alvisnlp.modules.classifiers.RelationDefinition

True

—

Coreference (3)

ANNIE Nominal Coreferencer

Category: Coreference
Framework: GATE
Version: unknown

Nominal Coreference resolution component

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

ANNIE Pronominal Coreferencer

Category: Coreference
Framework: GATE
Version: unknown

Pronominal Coreference resolution component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inanimatedEntityTypes

—

java.lang.String

—

Organization;Location

—

true

resolveIt

—

java.lang.Boolean

—

false

—

true

StanfordCoreferenceResolver

Category: Coreference
Framework: DKPro Core (UIMA)
Version: 1.8.0

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

maxDist

DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance)

Integer

True

—

false

—

postprocessing

DCoRef parameter: Do post processing

Boolean

True

—

false

—

score

DCoRef parameter: Scoring the output of the system

Boolean

True

—

false

—

sieves

DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/.

String

True

—

false

—

singleton

DCoRef parameter: setting singleton predictor

Boolean

True

—

false

—

CrowdSourcing (1)

Entity Annotation Job Builder

Category: CrowdSourcing
Framework: GATE
Version: unknown

Build a CrowdFlower job asking users to annotate entities within a snippet of text

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiKey

—

java.lang.String

—

corpus

—

gate.Corpus

—

true

detailFeatureName

—

java.lang.String

—

detail

—

true

document

—

gate.Document

—

true

entityASName

—

java.lang.String

—

true

entityAnnotationType

—

java.lang.String

—

true

goldFeatureName

—

java.lang.String

—

gold

—

true

goldFeatureValue

—

java.lang.String

—

yes

—

true

goldReasonFeatureName

—

java.lang.String

—

reason

—

true

jobId

—

java.lang.Long

—

true

skipExisting

—

java.lang.Boolean

—

true

—

true

snippetASName

—

java.lang.String

—

true

snippetAnnotationType

—

java.lang.String

—

Sentence

—

true

tokenASName

—

java.lang.String

—

true

tokenAnnotationType

—

java.lang.String

—

Token

—

true

Developers/Debugging (9)

DependencyDumper

Category: Developers/Debugging
Framework: DKPro Core (UIMA)
Version: 1.8.0

Dump dependencies to screen.

DocumentMetaDataStripper

Category: Developers/Debugging
Framework: DKPro Core (UIMA)
Version: 1.8.0

Removes fields from the document meta data which may be different depending on the machine a test is run on.

EDT Monitor

Category: Developers/Debugging
Framework: GATE
Version: unknown

Warns whenever an AWT component is updated from anywhere other than the event dispatch thread

JCasHolder

Category: Developers/Debugging
Framework: DKPro Core (UIMA)
Version: 1.8.0

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

Java Heap Dumper

Category: Developers/Debugging
Framework: GATE
Version: unknown

Dumps the Java heap to the specified file

Log4J Level: ALL

Category: Developers/Debugging
Framework: GATE
Version: unknown

Allows the Log4J log level to be set to ALL from within the GUI

Stopwatch

Category: Developers/Debugging
Framework: DKPro Core (UIMA)
Version: 1.8.0

Can be used to measure how long the processing between two points in a pipeline takes. For that purpose, the AE needs to be added two times, before and after the part of the pipeline that should be measured.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

timerName

Name of the timer pair. Upstream and downstream timer need to use the same name.

String

True

—

false

—

timerOutputFile

Name of the timer pair. Upstream and downstream timer need to use the same name.

String

False

—

false

—

TagsetDescriptionStripper

Category: Developers/Debugging
Framework: DKPro Core (UIMA)
Version: 1.8.0

Copyright 2012 Ubiquitous Knowledge Processing (UKP) Lab Technische Universität Darmstadt Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Unload Unused Plugins

Category: Developers/Debugging
Framework: GATE
Version: unknown

Unloads all plugins for which we cannot find any loaded instances

Evaluation (2)

CompareElements

Category: Evaluation
Framework: AlvisNLP
Version: 2012-04-30

Compares two sets of elements.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

face

—

alvisnlp.corpus.expressions.Expression

True

—

outFile

—

org.bibliome.util.streams.TargetStream

True

—

predicted

—

alvisnlp.corpus.expressions.Expression

True

—

reference

—

alvisnlp.corpus.expressions.Expression

True

—

sections

—

alvisnlp.corpus.expressions.Expression

True

—

showFullMatches

—

java.lang.Boolean

True

—

showPrecision

—

java.lang.Boolean

True

—

showRecall

—

java.lang.Boolean

True

—

similarity

—

org.bibliome.alvisnlp.modules.compare.ElementSimilarity

True

—

IAA Computation PR

Category: Evaluation
Framework: GATE
Version: unknown

Compute inter-annotator agreement (IAA).

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
annSetsForIaa	—	java.lang.String	—	—	—	true
annTypesAndFeats	—	java.lang.String	—	—	—	true
bdmScoreFile	—	java.net.URL	—	—	—	true
measureType	—	gate.iaaplugin.MeasureType	—	FMEASURE	—	true
verbosity	—	java.lang.String	—	1	—	true

Filtering (6)

AnnotationByLengthFilter

Category: Filtering
Framework: DKPro Core (UIMA)
Version: 1.8.0

Removes annotations that do not conform to minimum or maximum length constraints. (This was previously called TokenFilter).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

FilterTypes

A set of annotation types that should be filtered.

String

True

—

true

—

MaxLengthFilter

Any annotation in filterAnnotations shorter than this value will be removed.

Integer

True

—

false

—

MinLengthFilter

Any annotation in filterTypes shorter than this value will be removed.

Integer

True

—

false

—

AnnotationByTextFilter

Category: Filtering
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ignoreCase

If true, annotation texts are filtered case-independently. Default: true, i.e. words that occur in the list with different casing are not filtered out.

Boolean

True

—

false

—

modelEncoding

—

String

True

—

false

—

modelLocation

—

String

True

—

false

—

typeName

Annotation type to filter. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token.

String

True

—

false

—

Boilerpipe Content Detection

Category: Filtering
Framework: GATE
Version: unknown

Uses boilerpipe to determine which sections of a document are interesting content and which are just boilerplate

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

allContent

—

gate.creole.boilerpipe.Behaviour

—

NOT_LISTED

—

true

annotateBoilerplate

—

java.lang.Boolean

—

false

—

true

annotateContent

—

java.lang.Boolean

—

true

—

true

boilerplateAnnotationName

—

java.lang.String

—

Boilerplate

—

true

contentAnnotationName

—

java.lang.String

—

Content

—

true

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

extractor

—

gate.creole.boilerpipe.Extractor

—

DEFAULT

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

inputASName

—

java.lang.String

—

true

mimeTypes

—

java.util.Set

—

text/html

—

true

outputASName

—

java.lang.String

—

true

useHintsFromOriginalMarkups

—

java.lang.Boolean

—

true

—

true

PosFilter

Category: Filtering
Framework: DKPro Core (UIMA)
Version: 1.8.0

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

Verbs

Keep/remove verbs (true: keep, false: v)

Boolean

True

—

false

—

adj

Keep/remove adjectives (true: keep, false: remove)

Boolean

True

—

false

—

adv

Keep/remove adverbs (true: keep, false: remove)

Boolean

True

—

false

—

art

Keep/remove articles (true: keep, false: remove)

Boolean

True

—

false

—

card

Keep/remove cardinal numbers (true: keep, false: remove)

Boolean

True

—

false

—

conj

Keep/remove conjunctions (true: keep, false: remove)

Boolean

True

—

false

—

Keep/remove nouns (true: keep, false: remove)

Boolean

True

—

false

—

Keep/remove "others" (true: keep, false: remove)

Boolean

True

—

false

—

Keep/remove prepositions (true: keep, false: remove)

Boolean

True

—

false

—

Keep/remove pronouns (true: keep, false: remove)

Boolean

True

—

false

—

punc

Keep/remove punctuation (true: keep, false: remove)

Boolean

True

—

false

—

typeToRemove

The fully qualified name of the type that should be filtered.

String

True

—

false

—

RegexTokenFilter

Category: Filtering
Framework: DKPro Core (UIMA)
Version: 1.8.0

Remove every token that does or does not match a given regular expression.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

mustMatch

If this parameter is set to true (default), retain only tokens that match the regex given in #PARAM_REGEX. If set to false, all tokens that match the given regex are removed.

Boolean

True

—

false

—

regex

Every token that does or does not match this regular expression will be removed.

String

True

—

false

—

StopWordRemover

Category: Filtering
Framework: DKPro Core (UIMA)
Version: 1.8.0

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. Also remove any other of the specified types that is covered by a matching instance.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

Paths

Feature paths for annotations that should be matched/removed. The default is

<pre> StopWord.class.getName() Token.class.getName() Lemma.class.getName()+"/value" </pre>

String

False

—

true

—

StopWordType

Anything annotated with this type will be removed even if it does not match any word in the lists.

String

False

—

false

—

modelEncoding

The character encoding used by the model.

String

True

—

false

—

modelLocation

A list of URLs from which to load the stop word lists. If an URL is prefixed with a language code in square brackets, the stop word list is only used for documents in that language. Using no prefix or the prefix "[*]" causes the list to be used for every document. Example: "[de]classpath:/stopwords/en_articles.txt"

String

True

—

true

—

Flow (8)

Annotation Merging PR

Category: Flow
Framework: GATE
Version: unknown

Merge Annotations from different annotators.

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
annSetOutput	—	java.lang.String	—	—	—	true
annSetsForMerging	—	java.lang.String	—	—	—	true
annTypesAndFeats	—	java.lang.String	—	—	—	true
document	—	gate.Document	—	—	—	true
keepSourceForMergedAnnotations	—	java.lang.Boolean	—	true	—	true
mergingMethod	—	gate.merger.MergingMethodsEnum	—	MajorityVoting	—	true
minimalAnnNum	—	java.lang.String	—	1	—	true

Annotation Set Transfer

Category: Flow
Framework: GATE
Version: unknown

Annotation set transfer component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationTypes

—

java.util.ArrayList

—

true

copyAnnotations

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

tagASName

—

java.lang.String

—

true

textTagName

—

java.lang.String

—

true

transferAllUnlessFound

—

java.lang.Boolean

—

true

—

true

Combine Members PR

Category: Flow
Framework: GATE
Version: unknown

Combines documents in a composite document.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

combiningMethod

—

java.lang.String

—

gate.composite.impl.DefaultCombiningMethod

—

false

document

—

gate.Document

—

true

parameters

—

java.lang.String

—

unitAnnotationType=Sentence;inputASName=;copyUnderlyingAnnotations=true;

—

true

Delete Member PR

Category: Flow
Framework: GATE
Version: unknown

Deletes one member document from a compound doc.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

documentID

—

java.lang.String

—

true

Document Reset PR

Category: Flow
Framework: GATE
Version: unknown

Remove named annotation sets or reset the default annotation set

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationTypes

—

java.util.List

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

keepOriginalMarkupsAS

—

java.lang.Boolean

—

true

—

true

setsToKeep

—

java.util.List

—

Key

—

true

setsToRemove

—

java.util.List

—

true

Scriptable Controller

Category: Flow
Framework: GATE
Version: unknown

A controller whose execution strategy is controlled by a Groovy script

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

Segment Processing PR

Category: Flow
Framework: GATE
Version: unknown

Processes individual segments as separate documents

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

analyser

—

gate.LanguageAnalyser

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

segmentAnnotationFeatureName

—

java.lang.String

—

true

segmentAnnotationFeatureValue

—

java.lang.String

—

true

segmentAnnotationType

—

java.lang.String

—

Section

—

true

Switch Member PR

Category: Flow
Framework: GATE
Version: unknown

Sets the focus of a compound document to a specified member document.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

documentID

—

java.lang.String

—

true

Gazetteer (16)

ANNIE Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerFeatureSeparator

—

java.lang.String

—

listsURL

—

java.net.URL

—

resources/gazetteer/lists.def

—

longestMatchOnly

—

java.lang.Boolean

—

true

—

true

wholeWordsOnly

—

java.lang.Boolean

—

true

—

true

Arabic Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerFeatureSeparator

—

java.lang.String

—

listsURL

—

java.net.URL

—

resources/gazetteer/lists.def

—

longestMatchOnly

—

java.lang.Boolean

—

true

—

true

wholeWordsOnly

—

java.lang.Boolean

—

true

—

true

Arabic Infered Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerFeatureSeparator

—

java.lang.String

—

listsURL

—

java.net.URL

—

resources/inferred-gazetteer/lists.def

—

longestMatchOnly

—

java.lang.Boolean

—

true

—

true

wholeWordsOnly

—

java.lang.Boolean

—

true

—

true

Cebuano Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerFeatureSeparator

—

java.lang.String

—

listsURL

—

java.net.URL

—

resources/gazetteer/cebuano/lists.def

—

longestMatchOnly

—

java.lang.Boolean

—

true

—

true

wholeWordsOnly

—

java.lang.Boolean

—

true

—

true

DictionaryAnnotator

Category: Gazetteer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Takes a plain text file with phrases as input and annotates the phrases in the CAS file. The annotation type defaults to NGram, but can be changed. The component requires that Tokens and Sentencees are annotated in the CAS. The format of the phrase file is one phrase per line, tokens are separated by space:

this is a phrase
another phrase

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationType

The annotation to create on matching phases. If nothing is specified, this defaults to NGram.

String

False

—

false

—

modelEncoding

The character encoding used by the model.

String

True

—

false

—

modelLocation

The file must contain one phrase per line - phrases will be split at " "

String

True

—

false

—

value

The value to set the feature configured in #PARAM_VALUE_FEATURE to.

String

False

—

false

—

valueFeature

Set this feature on the created annotations.

String

False

—

false

—

Flexible Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A more flexible list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

gazetteerInst

—

gate.creole.gazetteer.Gazetteer

—

inputASName

—

java.lang.String

—

true

inputFeatureNames

—

java.util.List

—

outputASName

—

java.lang.String

—

true

Hash Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component implemented by OntoText Lab. The licence information is also available in licence.ontotext.html in the lib folder of GATE

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

listsURL

—

java.net.URL

—

resources/gazetteer/lists.def

—

Hindi Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

document

—

gate.corpora.DocumentImpl

—

true

encoding

—

java.lang.String

—

UTF-8

—

listsURL

—

java.net.URL

—

resources/gazetteer/lists.def

—

wholeWordsOnly

—

java.lang.Boolean

—

true

—

Hindi Tokeniser Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

document

—

gate.corpora.DocumentImpl

—

true

encoding

—

java.lang.String

—

UTF-8

—

listsURL

—

java.net.URL

—

resources/tokeniser/lists.def

—

wholeWordsOnly

—

java.lang.Boolean

—

true

—

Inflectional gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

Gazetteer with support for inflectional morphology

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

config

—

java.net.URL

—

resources/inflection_gaz/main.conf

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

importOnlyTheseTypes

—

java.util.List

—

person_first;person_full;surname

—

Large KB Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

KIM KB based alias-lookup commponent

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationLimit

—

java.lang.Integer

—

true

annotationSetName

—

java.lang.String

—

true

dictionaryPath

—

java.net.URL

—

dictionary

—

false

document

—

gate.Document

—

true

forceCaseSensitive

—

java.lang.Boolean

—

false

Onto Root Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

An ontology lookup component

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

considerHeuristicRules

—

java.lang.Boolean

—

false

—

considerProperties

—

java.lang.Boolean

—

true

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

longestMatchOnly

—

java.lang.Boolean

—

true

—

true

ontology

—

gate.creole.ontology.Ontology

—

propertiesToExclude

—

java.lang.String

—

propertiesToInclude

—

java.lang.String

—

rootFinderApplication

—

gate.CorpusController

—

separateCamelCasedWords

—

java.lang.Boolean

—

true

—

typesToConsider

—

java.util.Set

—

class;instance;property

—

true

useResourceUri

—

java.lang.Boolean

—

true

—

wholeWordsOnly

—

java.lang.Boolean

—

true

—

true

OntoGazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component based on mapping between ontology classes and gazetteer lists.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerName

—

java.lang.String

—

com.ontotext.gate.gazetteer.HashGazetteer

—

listsURL

—

java.net.URL

—

../ANNIE/resources/gazetteer/lists.def

—

mappingURL

—

java.net.URL

—

../ANNIE/resources/gazetteer/mapping.def

—

Romanian Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

document

—

gate.corpora.DocumentImpl

—

true

encoding

—

java.lang.String

—

UTF-8

—

listsURL

—

java.net.URL

—

resources/Gazeteer/list.lst

—

wholeWordsOnly

—

java.lang.Boolean

—

true

—

Russian Gazetteer

Category: Gazetteer
Framework: GATE
Version: unknown

Customised version of the hash gazetteer

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

listsURL

—

java.net.URL

—

resources/gazetteer/lists.def

—

Sharable Gazettee

Category: Gazetteer
Framework: GATE
Version: unknown

A list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

bootstrapGazetteer

—

gate.creole.gazetteer.DefaultGazetteer

—

caseSensitive

—

java.lang.Boolean

—

true

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerFeatureSeparator

—

java.lang.String

—

listsURL

—

java.net.URL

—

resources/gazetteer/lists.def

—

longestMatchOnly

—

java.lang.Boolean

—

true

—

true

wholeWordsOnly

—

java.lang.Boolean

—

true

—

true

Irrelevant (1)

The Duplicator

Category: Irrelevant
Framework: GATE
Version: unknown

Duplicate any resource with a right click menu option

Keywords/Terms (3)

KEA Keyphrase Extractor

Category: Keywords/Terms
Framework: GATE
Version: unknown

A Keyphrase Extractor by Eibe Frank.

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
disallowInternalPeriods	—	java.lang.Boolean	—	true	—	true
document	—	gate.Document	—	—	—	true
inputAS	—	java.lang.String	—	—	—	true
keyphraseAnnotationType	—	java.lang.String	—	Keyphrase	—	true
maxPhraseLength	—	java.lang.Integer	—	3	—	true
minNumOccur	—	java.lang.Integer	—	2	—	true
minPhraseLength	—	java.lang.Integer	—	1	—	true
outputAS	—	java.lang.String	—	—	—	true
phrasesToExtract	—	java.lang.Integer	—	5	—	true
trainingMode	—	java.lang.Boolean	—	true	—	true
useKFrequency	—	java.lang.Boolean	—	true	—	true

KeywordsSelector

Category: Keywords/Terms
Framework: AlvisNLP
Version:

Selects most relevant keywords in documents.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

charset

—

java.lang.String

True

—

documentId

—

alvisnlp.corpus.expressions.Expression

True

—

documents

—

alvisnlp.corpus.expressions.Expression

True

—

keywordCount

—

java.lang.Integer

True

—

keywordFeature

—

java.lang.String

False

—

keywordForm

—

alvisnlp.corpus.expressions.Expression

True

—

keywords

—

alvisnlp.corpus.expressions.Expression

True

—

outFile

—

org.bibliome.util.streams.TargetStream

False

—

scoreFeature

—

java.lang.String

False

—

scoreFunction

—

org.bibliome.alvisnlp.modules.keyword.KeywordScoreFunction

True

—

scoreThreshold

—

java.lang.Double

True

—

separator

—

java.lang.Character

True

—

YateaExtractor

Category: Keywords/Terms
Framework: AlvisNLP
Version: 2010-10-28

Extract terms from the corpus using the YaTeA term extractor.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

bioYatea

—

java.lang.Boolean

False

—

configDir

—

org.bibliome.util.files.InputDirectory

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

documentTokens

—

java.lang.Boolean

True

—

formFeature

—

java.lang.String

True

—

language

—

java.lang.String

False

—

lemmaFeature

—

java.lang.String

True

—

localeDir

—

org.bibliome.util.files.InputDirectory

False

—

outputDir

—

org.bibliome.util.files.OutputDirectory

False

—

perlLib

—

java.lang.String

False

—

posFeature

—

java.lang.String

True

—

postProcessingConfig

—

org.bibliome.util.files.InputFile

False

—

postProcessingOutput

—

org.bibliome.util.files.OutputFile

False

—

rcFile

—

org.bibliome.util.streams.SourceStream

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

True

—

suffix

—

java.lang.String

False

—

testifiedTerminology

—

org.bibliome.alvisnlp.modules.yatea.TestifiedTerminology

False

—

wordLayerName

—

java.lang.String

True

—

workingDir

—

org.bibliome.util.files.WorkingDirectory

True

—

yateaDefaultConfig

—

alvisnlp.module.types.Mapping

True

—

yateaExecutable

—

org.bibliome.util.files.ExecutableFile

True

—

yateaOptions

—

alvisnlp.module.types.Mapping

True

—

Language Identifier (7)

LangDetectLanguageIdentifier

Category: Language Identifier
Framework: DKPro Core (UIMA)
Version: 1.8.0

Langdetect language identifier based on character n-grams.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

modelLocation

Location from which the model is read.

String

False

—

false

—

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

String

False

—

false

—

LanguageDetectorWeb1T

Category: Language Identifier
Framework: DKPro Core (UIMA)
Version: 1.8.0

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

maxNGramSize

The maximum n-gram size that should be considered. Default is 3.

Integer

True

—

false

—

minNGramSize

The minimum n-gram size that should be considered. Default is 1.

Integer

True

—

false

—

LanguageIdentifier

Category: Language Identifier
Framework: DKPro Core (UIMA)
Version: 1.8.0

Detection based on character n-grams. Uses the Java Text Categorizing Library based on a technique by Cavnar and Trenkle.

References:

Cavnar, W. B. and J. M. Trenkle (1994). N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

LingPipe Language Identifier PR

Category: Language Identifier
Framework: GATE
Version: unknown

GATE PR for language identification using LingPipe

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

annotationType

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

languageIdFeatureName

—

java.lang.String

—

lang

—

true

modelFileUrl

—

java.net.URL

—

resources/models/langid-leipzig.classifier

—

TextCat Fingerprint Generator

Category: Language Identifier
Framework: GATE
Version: unknown

Generate language fingerprints for use with the TextCat Language Indentification PR

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

annotationType

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

fingerprintURL

—

java.net.URL

—

true

TextCat Language Identification

Category: Language Identifier
Framework: GATE
Version: unknown

Recognizes the document language using TextCat

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

annotationType

—

java.lang.String

—

true

configURL

—

java.net.URL

—

resources/default-names.conf

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

languageFeatureName

—

java.lang.String

—

lang

—

true

Textalytics Language Identification

http://textalytics.com/core/lang-1.1

Category: Language Identifier
Framework: GATE
Version: unknown

Textalytics Language Identification

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiURL

—

java.lang.String

—

—

true

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

inputASTypes

—

java.util.List

—

true

key

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

Textalytics

—

true

Lemmatizer (7)

ClearNlpLemmatizer

Category: Lemmatizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Lemmatizer using Clear NLP.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

GateLemmatizer

Category: Lemmatizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Wrapper for the GATE rule based lemmatizer. Based on code by Asher Stern from the BIUTEE textual entailment tool.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

ILSP Lemmatizer

Category: Lemmatizer
Framework: ILSP (UIMA)
Version: 1.1

ILSP Lemmatizer consults a assigns lemmas to tokens from Greek texts. ILSP Lemmatizer consults the ILSP Morphological Lexicon to assign lemmas to tokens. The AE uses POS tags (if they exist in the input) to select between lemmas when the ILSP ML returns more that one results for one token.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

LexicaDir

The directory containing the Berkeley DB lexical resources. Default is /opt/ilsp-nlp/lexica/fbt.

String

False

—

false

—

LanguageToolLemmatizer

Category: Lemmatizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Naive lexicon-based lemmatizer. The words are looked up using the wordform lexicons of LanguageTool. Multiple readings are produced. The annotator simply takes the most frequent lemma from those readings. If no readings could be found, the original text is assigned as lemma.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

sanitize

—

Boolean

True

—

false

—

sanitizeChars

—

String

True

—

true

—

MateLemmatizer

Category: Lemmatizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

DKPro Annotator for the MateToolsLemmatizer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

uppercase

Try reconstructing proper casing for lemmata. This is useful for German, but e.g. for English creates odd results.

Boolean

True

—

false

—

variant

Override the default variant used to locate the model.

String

False

—

false

—

MorphaLemmatizer

Category: Lemmatizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Lemmatize based on a finite-state machine. Uses the Java port of Morpha.

References:

Minnen, G., J. Carroll and D. Pearce (2001). Applied morphological processing of English, Natural Language Engineering, 7(3). 207-223.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

readPOS

Pass part-of-speech information on to Morpha. Since we currently do not know in which format the part-of-speech tags are expected by Morpha, we just pass on the actual pos tag value we get from the token. This may produce worse results than not passing on pos tags at all, so this is disabled by default.

Boolean

True

—

false

—

StanfordLemmatizer

Category: Lemmatizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Stanford Lemmatizer component. The Stanford Morphology-class computes the base form of English words, by removing just inflections (not derivational morphology). That is, it only does noun plurals, pronoun case, and verb endings, and not things like comparative adjectives or derived nominals. It is based on a finite-state transducer implemented by John Carroll et al., written in flex and publicly available. See: http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html

This only works for ENGLISH.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Boolean

True

—

false

—

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

String

False

—

true

—

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

String

False

—

true

—

Machine Learning (2)

Batch Learning PR

Category: Machine Learning
Framework: GATE
Version: unknown

Supports training, application and evaluation of machine learning models for NLP tasks

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

configFileURL

—

java.net.URL

—

false

corpus

—

gate.Corpus

—

true

inputASName

—

java.lang.String

—

true

learningMode

—

gate.learning.RunMode

—

TRAINING

—

true

outputASName

—

java.lang.String

—

true

runProtocolDir

—

java.net.URL

—

true

Machine Learning PR

Category: Machine Learning
Framework: GATE
Version: unknown

Trains a machine learning algorithm from a corpus. For new code, consider using the "learning" plugin instead.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

configFileURL

—

java.net.URL

—

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

training

—

java.lang.Boolean

—

true

—

true

MorphTagger (3)

GATE Morphological analyser

Category: MorphTagger
Framework: GATE
Version: unknown

Morphological Analyzer for the English Language.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

affixFeatureName

—

java.lang.String

—

affix

—

true

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

false

—

false

considerPOSTag

—

java.lang.Boolean

—

true

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

rootFeatureName

—

java.lang.String

—

root

—

true

rulesFile

—

java.net.URL

—

resources/morph/default.rul

—

false

RASP2 Morphological Analyser

Category: MorphTagger
Framework: GATE
Version: unknown

RASP morphological analyser, which adds lemma and suffix to the WordForm annotations produced by the RASP POS tagger (or the ANNIE POS tagger plus the RASP converter)

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

charset

—

java.lang.String

—

ISO-8859-1

—

true

debug

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

raspHome

—

java.net.URL

—

file:/usr/local/bin/RASP

—

false

SfstAnnotator

Category: MorphTagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Sfst morphological analyzer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

MorphMappingLocation

—

String

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

mode

—

String

True

—

false

—

modelEncoding

Specifies the model encoding.

String

True

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Write the tag set(s) to the log when a model is loaded.

Boolean

True

—

false

—

writeLemma

Write lemma information.

Default: true

Boolean

True

—

false

—

writePOS

Write part-of-speech information.

Default: true

Boolean

True

—

false

—

Named Entity Recognizer (11)

ABNER

Category: Named Entity Recognizer
Framework: NaCTeM (UIMA)
Version: 1.0

Wraps the ABNER entity identification system into the UIMA framework. ABNER was developed by Burr Settles and is available here: http://pages.cs.wisc.edu/~bsettles/abner/

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

mode

0=NLPBA Corpus, 1=BioCreative Corpus, 2=Custom

String

False

—

false

—

model

Custom model file (if mode == 2)

String

False

—

false

—

types

Custom type mapping; each string is <entity>=<class>

String

False

—

true

—

CRF++ Trainer

Category: Named Entity Recognizer
Framework: NaCTeM (UIMA)
Version: 1.0

Produces a Conditional Random Fields model. Based on CRF++, an implementation of CRF for labeling sequential data (http://crfpp.googlecode.com/svn/trunk/doc/index.html).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

FeatureFrequencyThreshold

CRF++ uses features that occur no less than this value. Default value is 1.

Integer

False

—

false

—

LabelAnnotationTypes

Fully qualified names of annotation types which will serve as labels during the training of the CRF.

String

True

—

true

—

ModelFileName

Specifies the filename to store the model in.

String

True

—

false

—

OverfittingBalance

Default value is 1

Float

False

—

false

—

RegularizationAlgorithm

Default value is CRF-L2. You can use CRF-L1.

String

False

—

false

—

ILSP NERC

Category: Named Entity Recognizer
Framework: ILSP (UIMA)
Version: 1.2

This module uses a Maximum Entropy NER engine focusing on EL or EN textual newsy data.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

DatabaseDriverClass

The JDBC driver to use to connect to this DB. Default is org.postgresql.Driver

String

False

—

false

—

DatabaseHost

The host where the server resides. Default is localhost.

String

False

—

false

—

DatabaseName

The name of the database.

String

False

—

false

—

DatabasePass

Use this password for read-only access to the database

String

False

—

false

—

DatabasePort

The port the server listens to. Default is 5432.

Integer

False

—

false

—

DatabaseServer

The type of server the AE connects to. Default is postgresql.

String

False

—

false

—

DatabaseUser

Use this user name for read-only access to the database

String

False

—

false

—

Language

ISO language code for text language

String

False

—

false

—

ModelDir

—

String

False

—

false

—

NercEngine

The NercEngine to be used. The default value is "mener".

String

False

—

false

—

LingPipe NER PR

Category: Named Entity Recognizer
Framework: GATE
Version: unknown

LingPipe Named Entity Recognizer

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

modelFileUrl

—

java.net.URL

—

false

outputASName

—

java.lang.String

—

true

OpenNLP NER

Category: Named Entity Recognizer
Framework: GATE
Version: unknown

NER PR using a set of OpenNLP maxent models

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

config

—

java.net.URL

—

models/english/en-ner.conf

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

OpenNlpNamedEntityRecognizer

Category: Named Entity Recognizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

OpenNLP name finder wrapper.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

String

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Location from which the model is read.

String

False

—

false

—

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

String

True

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Boolean

True

—

false

—

SVMLight Trainer

Category: Named Entity Recognizer
Framework: NaCTeM (UIMA)
Version: 1.0

Produces an SVMLight model based on user-specified learning parameters.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ModelFile

A file where the model will be written to.

String

True

—

false

—

NormFile

A file where the value of the norm for normalising continuous-valued features will be written to.

String

True

—

false

—

ParameterString

A string with the desired learning parameters.

String

False

—

false

—

Stanford NER

Category: Named Entity Recognizer
Framework: GATE
Version: unknown

Stanford Named Entity Recogniser

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

baseSentenceAnnotationType

—

java.lang.String

—

Sentence

—

true

baseTokenAnnotationType

—

java.lang.String

—

Token

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

inputASName

—

java.lang.String

—

true

modelFile

—

java.net.URL

—

resources/english.all.3class.distsim.crf.ser.gz

—

outputASName

—

java.lang.String

—

true

outsideLabel

—

java.lang.String

—

true

StanfordNER

Category: Named Entity Recognizer
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

classifierFile

—

org.bibliome.util.files.InputFile

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

formFeatureName

—

java.lang.String

True

—

labelFeatureName

—

java.lang.String

True

—

searchInContents

—

java.lang.Boolean

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

True

—

targetLayerName

—

java.lang.String

True

—

wordLayerName

—

java.lang.String

True

—

StanfordNamedEntityRecognizer

Category: Named Entity Recognizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Stanford Named Entity Recognizer component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

String

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Location from which the model is read.

String

False

—

false

—

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Boolean

True

—

false

—

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Boolean

True

—

false

—

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

String

False

—

true

—

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

String

False

—

true

—

Yeast Metabliner

Category: Named Entity Recognizer
Framework: NaCTeM (UIMA)
Version: 1.0

This service is to annotate yeast metabolites with a supervised NER system using CRF. It receives an input string and a user id and returns a list of recognised yeast metabolites with offset information and score from CRF. The dictionary used in this system is based on a consensus reconstruction of yeast metabolism (http://www.comp-sys-bio.org/yeastnet/).

Normalizer (19)

ApplyChangesAnnotator

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Applies changes annotated using a SofaChangeAnnotation.

Backmapper

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

Chain

Chain of views for backmapping. This should be the reverse of the chain of views that the ApplyChangesAnnotator has used.

For example, if view A has been mapped to B using ApplyChangesAnnotator, then this parameter should be set using an array containing [B, A].

String

False

—

true

—

CapitalizationNormalizer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Takes a text and replaces wrong capitalization

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

CjfNormalizer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Converts traditional Chinese to simplified Chinese or vice-versa.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

direction

—

String

True

—

false

—

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

Date Annotation Normalizer

Category: Normalizer
Framework: GATE
Version: unknown

provides normalized values for all existing date annotations

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationFeature

—

java.lang.String

—

string

—

true

annotationName

—

java.lang.String

—

Date

—

true

corpus

—

gate.Corpus

—

true

dateFormat

—

java.lang.String

—

dd/MM/yyyy

—

true

document

—

gate.Document

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

inputASName

—

java.lang.String

—

true

locale

—

java.lang.String

—

normalizedDocumentFeature

—

java.lang.String

—

normalized-date

—

true

numericOutput

—

java.lang.Boolean

—

false

—

true

outputASName

—

java.lang.String

—

true

sourceOfDocumentDate

—

java.util.List

—

true

wholeMatchOnly

—

java.lang.Boolean

—

true

—

true

Date Normalizer

Category: Normalizer
Framework: GATE
Version: unknown

provides normalized values for all known dates

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationName

—

java.lang.String

—

Date

—

true

corpus

—

gate.Corpus

—

true

dateFormat

—

java.lang.String

—

dd/MM/yyyy

—

true

document

—

gate.Document

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

inputASName

—

java.lang.String

—

true

locale

—

java.lang.String

—

normalizedDocumentFeature

—

java.lang.String

—

normalized-date

—

true

numericOutput

—

java.lang.Boolean

—

false

—

true

outputASName

—

java.lang.String

—

true

sourceOfDocumentDate

—

java.util.List

—

true

DictionaryBasedTokenTransformer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads a tab-separated file containing mappings from one token to another. All tokens that match an entry in the first column are changed to the corresponding token in the second column.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

commentMarker

Lines starting with this character (or String) are ignored. Default: '#'

String

True

—

false

—

modelEncoding

—

String

True

—

false

—

modelLocation

—

String

True

—

false

—

separator

Separator for mappings file. Default: "\t" (TAB).

String

True

—

false

—

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

Document normalizer

Category: Normalizer
Framework: GATE
Version: unknown

Normalize document content to remove "smart quotes" etc.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

replacementsURL

—

java.net.URL

—

resources/replacements.lst

—

ExpressiveLengtheningNormalizer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Takes a text and shortens extra long words

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

FileBasedTokenTransformer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ignoreCase

—

Boolean

True

—

false

—

modelLocation

—

String

True

—

false

—

replacement

—

String

True

—

false

—

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

HyphenationRemover

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Simple dictionary-based hyphenation remover.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

modelEncoding

—

String

True

—

false

—

modelLocation

—

String

True

—

false

—

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

RegexBasedTokenTransformer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

The parameters #PARAM_REGEX defines the regular expression to be searcher, #PARAM_REPLACEMENT defines the string with which matching patterns are replaces.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

regex

Define the regular expression to be replaced

String

True

—

false

—

replacement

Define the string to replace matching tokens with

String

True

—

false

—

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

ReplacementFileNormalizer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

modelLocation

Location of a file which contains all replacing characters

String

True

—

false

—

srcExpressionSurroundings

—

String

True

—

false

—

targetExpressionSurroundings

—

String

True

—

false

—

SharpSNormalizer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Takes a text and replaces sharp s

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

MinFrequencyThreshold

—

Integer

True

—

false

—

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

SpellingNormalizer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

StanfordPtbTransformer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. This component operates directly on the text and does not require prior segmentation.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

TokenCaseTransformer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

tokenCase

The case to convert tokens to: <ul> <li>UPPERCASE: uppercase everything.</li> <li>LOWERCASE: lowercase everything.</li> <li>NORMALCASE: retain first letter in word and after hyphens, lowercase everything else.</li> </ul>

String

True

—

false

—

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

Tweet Normaliser

Category: Normalizer
Framework: GATE
Version: unknown

Normalise texts in tweets (convert into standard English spelling mistakes, colloquialisms, typing variations and so on)

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
corpus	—	gate.Corpus	—	—	—	true
dictURL	—	java.net.URL	—	resources/normaliser/english.jaspell	—	—
document	—	gate.Document	—	—	—	true
initialTextFeature	—	java.lang.String	—	string	—	true
inputASName	—	java.lang.String	—	—	—	true
maxDistance	—	java.lang.String	—	2.0	—	true
normTextFeature	—	java.lang.String	—	string	—	true
origTextFeature	—	java.lang.String	—	origString	—	true
orthURL	—	java.net.URL	—	resources/normaliser/orth.en.csv	—	—
outputASName	—	java.lang.String	—	—	—	true

UmlautNormalizer

Category: Normalizer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

MinFrequencyThreshold

—

Integer

True

—

false

—

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

String

True

—

true

—

Parser (24)

BerkeleyParser

Category: Parser
Framework: DKPro Core (UIMA)
Version: 1.8.0

Berkeley Parser annotator . Requires Sentences to be annotated before.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ConstituentMappingLocation

Location of the mapping file for constituent tags to UIMA types.

String

False

—

false

—

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

accurate

Set thresholds for accuracy. Default: false (set thresholds for efficiency)

Boolean

True

—

false

—

binarize

Output binarized trees. Default: false

Boolean

True

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

keepFunctionLabels

Retain predicted function labels. Model must have been trained with function labels. Default: false

Boolean

True

—

false

—

language

Use this language instead of the language set in the CAS to locate the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

readPOS

Sets whether to use or not to use already existing POS tags from another annotator for the parsing process. Default: false

Boolean

True

—

false

—

scores

Output inside scores (only for binarized viterbi trees). Default: false

Boolean

True

—

false

—

substates

Output sub-categories (only for binarized Viterbi trees). Default: false

Boolean

True

—

false

—

variational

Use variational rule score approximation instead of max-rule Default: false

Boolean

True

—

false

—

viterbi

Compute Viterbi derivation instead of max-rule tree. Default: false (max-rule)

Boolean

True

—

false

—

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Default: true

Boolean

True

—

false

—

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Default: false

Boolean

True

—

false

—

CCGParser

Category: Parser
Framework: AlvisNLP
Version: 2012-04-30

Syntax parsing with CCG Parser.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

dependentRole

—

java.lang.String

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

executable

—

org.bibliome.util.files.ExecutableFile

True

—

formFeatureName

—

java.lang.String

True

—

headRole

—

java.lang.String

True

—

internalEncoding

—

java.lang.String

True

—

labelFeatureName

—

java.lang.String

True

—

lpTransformation

—

java.lang.Boolean

False

—

maxRuns

—

java.lang.Integer

True

—

maxSuperCats

—

java.lang.Integer

True

—

parserModel

—

org.bibliome.util.files.InputDirectory

True

—

posFeatureName

—

java.lang.String

True

—

relationName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

True

—

sentenceRole

—

java.lang.String

True

—

stanfordMarkedUpScript

—

org.bibliome.util.files.InputFile

False

—

stanfordScript

—

org.bibliome.util.files.ExecutableFile

False

—

superModel

—

org.bibliome.util.files.InputDirectory

True

—

wordLayerName

—

java.lang.String

True

—

ClearNlpParser

Category: Parser
Framework: DKPro Core (UIMA)
Version: 1.8.0

Clear parser annotator.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Location from which the model is read.

String

False

—

false

—

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

String

False

—

false

—

printTagSet

Write the tag set(s) to the log when a model is loaded.

Boolean

True

—

false

—

English Dependency Parser

Category: Parser
Framework: GATE
Version: unknown

Ready-made application for Stanford English parser

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

Stanford Parser

—

pipelineURL

—

java.net.URL

—

sample_parser_en.gapp

—

English POS Tagger and Dependency Parser

Category: Parser
Framework: GATE
Version: unknown

Ready-made application for Stanford English POS tagger and parser

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

Stanford Parser

—

pipelineURL

—

java.net.URL

—

sample_pos+parser_en.gapp

—

Enju Parser

Category: Parser
Framework: NaCTeM (UIMA)
Version: 1.1

A syntactic parser for English. With a wide-coverage probabilistic HPSG grammar and an efficient parsing algorithm, this parser can effectively analyze syntactic/semantic structures of English sentences and provide a user with phrase structures and predicate-argument structures. Main features: Accurate deep analysis - the parser can output both phrase structures and predicate-argument structures. The accuracy of predicate-argument relations is around 90% for newswire articles and biomedical papers. High speed - parsing speed is less than 500 msec. per sentence by default (faster than most Penn Treebank parsers), and less than 50 msec. when using the high-speed setting ("mogura"). Enju website: http://www.nactem.ac.uk/enju

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

DisablePOSTagging

Take tokens and their corresponding POS tags as input from a preceding component.

Boolean

True

—

false

—

DisableTokenisation

Take tokens as input from a preceding component.

Boolean

True

—

false

—

UseBiomedicalModel

Use the biomedical model trained on the GENIA corpus.

Boolean

True

—

false

—

UseHighSpeedParser

Use the high speed parser "mogura".

Boolean

True

—

false

—

EnjuParser

Category: Parser
Framework: AlvisNLP
Version:

Parses sentences with the ENJU dependency parser.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

biology

—

java.lang.Boolean

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

dependenciesRelationName

—

java.lang.String

True

—

dependencyHeadRole

—

java.lang.String

True

—

dependencyLabelFeatureName

—

java.lang.String

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

enjuEncoding

—

java.lang.String

True

—

enjuExecutable

—

org.bibliome.util.files.ExecutableFile

True

—

nBest

—

java.lang.Integer

True

—

parseNumberFeatureName

—

java.lang.String

True

—

parseStatusFeature

—

java.lang.String

True

—

posFeatureName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

True

—

sentenceRole

—

java.lang.String

True

—

wordFormFeatureName

—

java.lang.String

True

—

wordLayerName

—

java.lang.String

True

—

EnjuParser2

Category: Parser
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

biology

—

java.lang.Boolean

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

dependenciesRelationName

—

java.lang.String

True

—

dependencyDependentRole

—

java.lang.String

True

—

dependencyHeadRole

—

java.lang.String

True

—

dependencyLabelFeatureName

—

java.lang.String

True

—

dependentTypeFeatureName

—

java.lang.String

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

enjuEncoding

—

java.lang.String

True

—

enjuExecutable

—

org.bibliome.util.files.ExecutableFile

True

—

nBest

—

java.lang.Integer

True

—

parseNumberFeatureName

—

java.lang.String

True

—

parseStatusFeatureName

—

java.lang.String

True

—

posFeatureName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

True

—

sentenceRole

—

java.lang.String

True

—

wordFormFeatureName

—

java.lang.String

True

—

wordLayerName

—

java.lang.String

True

—

FreelingShallowParser

Category: Parser
Framework: NaCTeM (UIMA)
Version: 1.0

Performs tokenisation, lemmatisation, POS tagging and shallow parsing (chunking). Operates on different languages by setting the "language" parameter. Default language is English (en). Also operates on Spanish (es), Catalan (ca), Galician (gl), and Asturian (ast).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

—

String

True

—

false

—

GENIA Dependency Parser

Category: Parser
Framework: NaCTeM (UIMA)
Version: 1.0

A dependency parser for biomedical text. The model was trained on the GENIA Treebank. Original software developed by Tsujii Lab (University of Tokyo) and the Institute for Creative Technologies (University of Southern California). Website: http://people.ict.usc.edu/~sagae/parser/gdep/

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

DisableTokenisation

Take tokens as input from a preceding component.

Boolean

True

—

false

—

ILSP Dependency Parser

Category: Parser
Framework: ILSP (UIMA)
Version: 1.15

ILSP Dependency Parser is a tool trained on the Greek Dependency Treebank (Prokopidis et al., 2005), a resource which comprises data annotated at several linguistic levels. Training data at the level of syntax consisted of ~150+ KWords annotated using a dependency-based syntactic scheme that includes 25 main relations. Different types of parsers (transition-based. graph-based, Maltparser, MateParser) are used during training and application of learned models. ILSP Dependency Parser is used in parsing EL POS-tagged and lemmatized sentences.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

useDepParser

The dependency parser to use

String

True

—

false

—

MaltParser

Category: Parser
Framework: DKPro Core (UIMA)
Version: 1.8.0

Dependency parsing using MaltPaser.

Required annotations:

Token
Sentence
POS

Generated annotations:

Dependency (annotated over sentence-span)

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ignoreMissingFeatures

Process anyway, even if the model relies on features that are not supported by this component.

Default: false

Boolean

True

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

MateParser

Category: Parser
Framework: DKPro Core (UIMA)
Version: 1.8.0

DKPro Annotator for the MateToolsParser.

Please cite the following paper, if you use the parser: Bernd Bohnet. 2010. Top Accuracy and Fast Dependency Parsing is not a Contradiction. The 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

DependencyMappingLocation

Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

Minipar Wrapper

Category: Parser
Framework: GATE
Version: unknown

MiniPar is a shallow parser. It determines the dependency relationships between the words of a sentence.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationInputSetName

—

java.lang.String

—

true

annotationOutputSetName

—

java.lang.String

—

true

annotationTypeName

—

java.lang.String

—

DepTreeNode

—

false

document

—

gate.Document

—

true

miniparBinary

—

java.net.URL

—

true

miniparDataDir

—

java.net.URL

—

true

MstParser

Category: Parser
Framework: DKPro Core (UIMA)
Version: 1.8.0

Dependency parsing using MSTParser.

Wrapper for the MSTParser (high memory requirements). More information about the parser can be found here here

The MSTParser models tend to be very large, e.g. the Eisner model is about 600 MB uncompressed. With this model, parsing a simple sentence with MSTParser requires about 3 GB heap memory.

This component feeds MSTParser only with the FORM (token) and POS (part-of-speech) fields. LEMMA, CPOS, and other columns from the CONLL 2006 format are not generated (cf. mstparser.DependencyInstance DependencyInstance).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

DependencyMappingLocation

Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

order

Specifies the order/scope of features. 1 only has features over single edges and 2 has features over pairs of adjacent edges in the tree. The model must have been trained with the respective order set here.

Integer

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

OpenNLP Parser

Category: Parser
Framework: GATE
Version: unknown

Syntactic parser from Apache OpenNLP

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

model

—

java.net.URL

—

models/english/en-parser-chunking.bin

—

OpenNLPParser

Category: Parser
Framework: NaCTeM (UIMA)
Version: 1.0

Parse the document and create phrasal and clausal annotations over the text. Uses the OpenNLP MaxEnt parser. This analysis engine takes a parameter called "ParseTagMapping" which maps each parse tag to a syntax annotation type. The parse tags come from the standard Penn Tree Bank phrase and clause tags (produced by the OpenNLP parser), and each syntax annotation type must be defined in the type system and have a corresponding JCas Java class.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

AdvancePercentage

—

Float

False

—

false

—

BeamSize

—

Integer

False

—

false

—

CaseSensitiveTagDictionary

—

Boolean

False

—

false

—

ModelDirectory

—

String

True

—

false

—

ParseTagMappings

—

String

True

—

true

—

UseTagDictionary

—

Boolean

False

—

false

—

OpenNlpParser

Category: Parser
Framework: DKPro Core (UIMA)
Version: 1.8.0

OpenNLP parser. The parser ignores existing POS tags and internally creates new ones. However, these tags are only added as annotation if explicitly requested via #PARAM_WRITE_POS.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ConstituentMappingLocation

Location of the mapping file for constituent tags to UIMA types.

String

False

—

false

—

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: true

Boolean

True

—

false

—

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Default: false

Boolean

True

—

false

—

RASP2 Parser

Category: Parser
Framework: GATE
Version: unknown

RASP dependency parser

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

charset

—

java.lang.String

—

ISO-8859-1

—

true

debug

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

outputFormat

—

java.lang.String

—

-og

—

true

phrasalVerbs

—

java.lang.Boolean

—

true

—

true

raspHome

—

java.net.URL

—

file:/usr/local/bin/RASP

—

false

subcategorisation

—

java.lang.Boolean

—

true

—

true

Stanford Dependency Parser

Category: Parser
Framework: NaCTeM (UIMA)
Version: 1.6.1

Generates Stanford-style dependencies together with POS tokens for English. It wraps parts of the Stanford Parser version 1.6.1. The project's website: http://www-nlp.stanford.edu/downloads/lex-parser.shtml.

StanfordDependencyConverter

Category: Parser
Framework: DKPro Core (UIMA)
Version: 1.8.0

Converts a constituency structure into a dependency structure.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model and tag set mapping.

String

False

—

false

—

mode

Sets the kind of dependencies being created.

Default: DependenciesMode#COLLAPSED TREE

String

False

—

false

—

originalDependencies

Create original dependencies. If this is disabled, universal dependencies are created. The default is to create the original dependencies.

Boolean

True

—

false

—

StanfordParser

Category: Parser
Framework: GATE
Version: unknown

Stanford parser wrapper

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

addConstituentAnnotations

—

java.lang.Boolean

—

true

—

true

addDependencyAnnotations

—

java.lang.Boolean

—

true

—

true

addDependencyFeatures

—

java.lang.Boolean

—

true

—

true

addPosTags

—

java.lang.Boolean

—

false

—

true

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

false

—

true

dependencyMode

—

gate.stanford.DependencyMode

—

Typed

—

true

document

—

gate.Document

—

true

includeExtraDependencies

—

java.lang.Boolean

—

false

—

true

inputSentenceType

—

java.lang.String

—

Sentence

—

true

inputTokenType

—

java.lang.String

—

Token

—

true

mappingFile

—

java.net.URL

—

parserFile

—

java.net.URL

—

resources/englishRNN.ser.gz

—

reusePosTags

—

java.lang.Boolean

—

false

—

true

tlppClass

—

java.lang.String

—

edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams

—

useMapping

—

java.lang.Boolean

—

false

—

true

StanfordParser

Category: Parser
Framework: DKPro Core (UIMA)
Version: 1.8.0

Stanford Parser component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ConstituentMappingLocation

Location of the mapping file for constituent tags to UIMA types.

String

False

—

false

—

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

annotationTypeToParse

This parameter can be used to override the standard behavior which uses the Sentence annotation as the basic unit for parsing. If the parameter is set with the name of an annotation type x, the parser will no longer parse Sentence-annotations, but x-Annotations. Default: null

String

False

—

false

—

language

Use this language instead of the document language to resolve the model and tag set mapping.

String

False

—

false

—

maxItems

Controls when the factored parser considers a sentence to be too complex and falls back to the PCFG parser. Default: 200000

Integer

True

—

false

—

maxSentenceLength

Maximum number of tokens in a sentence. Longer sentences are not parsed. This is to avoid out of memory exceptions. Default: 130

Integer

True

—

false

—

mode

Sets the kind of dependencies being created.

Default: DependenciesMode#COLLAPSED TREE

String

False

—

false

—

modelLocation

Location from which the model is read.

String

False

—

false

—

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

String

False

—

false

—

printTagSet

Write the tag set(s) to the log when a model is loaded.

Boolean

True

—

false

—

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Boolean

True

—

false

—

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

String

False

—

true

—

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

String

False

—

true

—

readPOS

Sets whether to use or not to use already existing POS tags from another annotator for the parsing process. Default: true

Boolean

True

—

false

—

writeConstituent

Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization. Default: true

Boolean

True

—

false

—

writeDependency

Sets whether to create or not to create dependency annotations.

Default: true

Boolean

True

—

false

—

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Default: false

Boolean

True

—

false

—

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Default: false

Boolean

True

—

false

—

Textalytics Lemmatization, PoS and Parsing

http://textalytics.com/core/parser-1.2

Category: Parser
Framework: GATE
Version: unknown

Textalytics Lemmatization, PoS and Parsing

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiURL

—

java.lang.String

—

—

true

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

true

dictionary

—

java.lang.String

—

true

disambiguationLevel

—

daedalus.textalytics.gate.param.DisambiguationLevel

—

strong_disambiguation

—

true

document

—

gate.Document

—

true

inputASTypes

—

java.util.List

—

true

inputASname

—

java.lang.String

—

true

key

—

java.lang.String

—

true

lang

—

java.lang.String

—

true

outputASname

—

java.lang.String

—

Textalytics

—

true

relaxedTypography

—

java.lang.Boolean

—

true

—

java.lang.String

—

true

unknownWords

—

java.lang.Boolean

—

true

Pre-built Workflows (12)

Arabic IE System

Category: Pre-built Workflows
Framework: GATE
Version: unknown

Ready-made Arabic IE application

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Cebuano IE System

Category: Pre-built Workflows
Framework: GATE
Version: unknown

Ready-made Cebuano IE application

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Chinese IE System

Category: Pre-built Workflows
Framework: GATE
Version: unknown

Ready-made Chinese IE application

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

French IE System

Category: Pre-built Workflows
Framework: GATE
Version: unknown

Ready-made French IE application

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

German IE System

Category: Pre-built Workflows
Framework: GATE
Version: unknown

Ready-made German IE application

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Measurements

Category: Pre-built Workflows
Framework: GATE
Version: unknown

Ready-made application for measurement annotator

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Romanian IE System

Category: Pre-built Workflows
Framework: GATE
Version: unknown

Ready-made Romanian IE application

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

RussIE

Category: Pre-built Workflows
Framework: GATE
Version: unknown

Basic version of the RussIE application

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

[[RussIE_-Inflectional_Gazetteer&_OrthoMatcher]] ==== RussIE + Inflectional Gazetteer & OrthoMatcher

Category: Pre-built Workflows
Framework: GATE
Version: unknown

RussIE application with orthomatcher and inflexional gazetteer

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

RussIE + Inflectional Gazetter

Category: Pre-built Workflows
Framework: GATE
Version: unknown

RussIE application with inflexional gazetteer

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

RussIE + OrthoMatcher

Category: Pre-built Workflows
Framework: GATE
Version: unknown

RussIE application with orthomatcher

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

[[TwitIE_(EN)]] ==== TwitIE (EN)

Category: Pre-built Workflows
Framework: GATE
Version: unknown

English TwitIE application

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Readability (1)

ReadabilityAnnotator

Category: Readability
Framework: DKPro Core (UIMA)
Version: 1.8.0

Assign a set of popular readability scores to the text.

Reader (91)

ACE Corpus Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads ...

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

folders

A list of folders containing ACE 2005 corpus files. The folders must contain pairs of *.sgm and *.apf.xml files.

String

True

—

true

—

AclAnthologyReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reada the ACL anthology corpus and outputs CASes with plain text documents.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

Encoding

Name of configuration parameter that contains the character encoding used by the input files. If not specified, the default system encoding will be used.

String

True

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard <code>/**/</code> can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

String

False

—

true

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

Aimed Collection Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads Aimed corpus (225 abstracts from MEDLINE) with the gold standard sentence, protein, protein-protein interaction anntations.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

GeneratePpiAnnotations

—

Boolean

True

—

false

—

GenerateProteinAnnotations

—

Boolean

True

—

false

—

GenerateSentenceAnnotations

—

Boolean

True

—

false

—

NumberOfArticles

—

Integer

False

—

false

—

PubmedIds

Specifies pubmedIDs to pick articles. This parameter has the highest priority.

Integer

False

—

true

—

StartingFromArticle

—

Integer

False

—

false

—

AlvisAEReader

Category: Reader
Framework: AlvisNLP
Version:

reads documents and annotations from an AlvisAE campaign.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

campaignId

—

java.lang.Integer

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

groupItemRolePrefix

—

java.lang.String

True

—

htmlLayerName

—

java.lang.String

True

—

linkToAnnotation

—

java.lang.Boolean

True

—

maxDate

—

java.lang.String

False

—

password

—

java.lang.String

True

—

schema

—

java.lang.String

True

—

sectionName

—

java.lang.String

True

—

taskId

—

java.lang.Integer

False

—

textBoundFragmentRolePrefix

—

java.lang.String

True

—

textBoundRelationName

—

java.lang.String

True

—

typeFeature

—

java.lang.String

True

—

url

—

java.lang.String

True

—

userId

—

java.lang.Integer

False

—

userLayerName

—

java.lang.String

True

—

username

—

java.lang.String

True

—

AlvisAEReader2

Category: Reader
Framework: AlvisNLP
Version:

reads documents and annotations from an AlvisAE campaign.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

adjudicate

—

java.lang.Boolean

False

—

annotationIdFeature

—

java.lang.String

True

—

annotationSetIdFeature

—

java.lang.String

True

—

campaignId

—

java.lang.Integer

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

createdFeature

—

java.lang.String

True

—

descriptionFeature

—

java.lang.String

True

—

docDescriptions

—

java.lang.String[]

False

—

docExternalIds

—

java.lang.String[]

False

—

docIds

—

java.lang.Integer[]

False

—

externalIdFeature

—

java.lang.String

True

—

fragmentRolePrefix

—

java.lang.String

True

—

fragmentTypeFeature

—

java.lang.String

True

—

fragmentsLayerName

—

java.lang.String

True

—

head

—

java.lang.Boolean

True

—

itemRolePrefix

—

java.lang.String

True

—

kindFeature

—

java.lang.String

True

—

loadDependencies

—

java.lang.Boolean

False

—

loadGroups

—

java.lang.Boolean

True

—

loadRelations

—

java.lang.Boolean

True

—

loadTextBound

—

java.lang.Boolean

True

—

oldModel

—

java.lang.Boolean

False

—

password

—

java.lang.String

True

—

referentFeature

—

java.lang.String

True

—

schema

—

java.lang.String

True

—

sectionName

—

java.lang.String

True

—

sourceRolePrefix

—

java.lang.String

True

—

taskFeature

—

java.lang.String

False

—

taskId

—

java.lang.Integer

False

—

taskName

—

java.lang.String

False

—

typeFeature

—

java.lang.String

True

—

url

—

java.lang.String

True

—

userFeature

—

java.lang.String

False

—

userIds

—

java.lang.Integer[]

False

—

userNames

—

java.lang.String[]

False

—

username

—

java.lang.String

True

—

AnimalReader

Category: Reader
Framework: AlvisNLP
Version: 2012-04-30

Project-specific file reader.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

bodySectionName

—

java.lang.String

True

—

charset

—

java.lang.String

True

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

linesLimit

—

java.lang.Integer

False

—

sizeLimit

—

java.lang.Integer

False

—

sourcePath

—

org.bibliome.util.streams.SourceStream

True

—

titleSectionName

—

java.lang.String

True

—

xmlDir

—

org.bibliome.util.files.InputDirectory

True

—

[[AssertAnnotations$InternalStringReader]] ==== AssertAnnotations$InternalStringReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Descriptor automatically generated by uimaFIT

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

documentText

—

String

True

—

false

—

language

—

String

True

—

false

—

BIO Format Collection Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads BIO format files from specified directory.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

BinaryCasReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

UIMA Binary CAS formats reader.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

typeSystemLocation

The location from which to obtain the type system when the CAS is stored in form 0.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

BioC Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads a file in BioC format. A BioC file contains a collection of documents with annotations. BioC website: http://bioc.sourceforge.net/

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

inputFile

A path to a BioC file.

String

True

—

false

—

BioCreative CHEMDNER Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 0.1

Reads data prepared specifically for the BioCreative IV's CHEMDNER track. This component transcribes annotations into the BioC type system.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

abstractsFile

A file with a set of abstracts

String

True

—

false

—

annotationsFile

A file with standoff annotations

String

True

—

false

—

BioNLP ST Data Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.1

Reads files formatted for the BioNLP Shared Task series and outputs documents with named entity, relation and event annotations. File syntax is available on http://2013.bionlp-st.org/.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

folders

A list of folders containing BioNLP Shared Task-format files. The folders must contain at least ".txt" files and optionally ".a1" and ".a2" files.

String

True

—

true

—

BioNLPSTReader

Category: Reader
Framework: AlvisNLP
Version:

Reads documents and annotations in the BioNLP-ST 2013 a1/a2 format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

a1Dir

—

org.bibliome.util.files.InputDirectory

False

—

a2Dir

—

org.bibliome.util.files.InputDirectory

False

—

active

—

alvisnlp.corpus.expressions.Expression

True

—

charset

—

java.lang.String

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

equivalenceItemPrefix

—

java.lang.String

True

—

equivalenceRelationName

—

java.lang.String

True

—

eventKind

—

java.lang.String

True

—

fragmentCountFeatureName

—

java.lang.String

True

—

idFeatureName

—

java.lang.String

True

—

kindFeatureName

—

java.lang.String

True

—

relationKind

—

java.lang.String

True

—

schema

—

org.bibliome.util.bionlpst.schema.DocumentSchema

False

—

sectionName

—

java.lang.String

True

—

textBoundAsAnnotations

—

java.lang.Boolean

False

—

textBoundFragmentRolePrefix

—

java.lang.String

True

—

textDir

—

org.bibliome.util.files.InputDirectory

True

—

textKind

—

java.lang.String

True

—

triggerRole

—

java.lang.String

True

—

typeFeatureName

—

java.lang.String

True

—

BlikiWikipediaReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Bliki-based Wikipedia reader.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

The language of the wiki installation.

String

True

—

false

—

outputPlainText

Whether the reader outputs plain text or wiki markup.

Boolean

True

—

false

—

pageTitles

Which page titles should be retrieved.

String

True

—

true

—

sourceLocation

Wikiapi URL E.g. for the English Wikipedia it should be: http://en.wikipedia.org/w/api.php

String

True

—

false

—

BncReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reader for the British National Corpus (XML version).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data.

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

BratReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reader for the brat format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

relationTypes

Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g. <code>de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent{A}</code>. Additionally, a subcategorization feature may be specified.

String

True

—

true

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

textAnnotationTypes

Types that are text annotations. It is mandatory to provide the type name which can optionally be followed by a subcategorization feature. Using this parameter is only necessary to specify a subcategorization feature. Otherwise, text annotation types are automatically detected.

String

True

—

true

—

typeMappings

—

String

False

—

true

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

CombinationReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Combines multiple readers into a single reader.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

readers

—

String

True

—

true

—

Conll2000Reader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads the Conll 2000 chunking format.


He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

FORM - token
POSTAG - part-of-speech tag
CHUNK - chunk (BIO encoded)

Sentences are separated by a blank new line.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ChunkMappingLocation

Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

ChunkTagSet

Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

String

False

—

false

—

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

readChunk

Write chunk information.

Default: true

Boolean

True

—

false

—

readPOS

Write part-of-speech information.

Default: true

Boolean

True

—

false

—

sourceEncoding

Character encoding of the input data.

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

Conll2002Reader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads the CoNLL 2002 named entity format. The columns are separated by a single space, like illustrated below.


Wolff      B-PER
,          O
currently  O
a          O
journalist O
in         O
Argentina  B-LOC
,          O
played     O
with       O
Del        B-PER
Bosque     I-PER
in         O
the        O
final      O
years      O
of         O
the        O
seventies  O
in         O
Real       B-ORG
Madrid     I-ORG
.          O

FORM - token
NER - named entity (BIO encoded)

Sentences are separated by a blank new line.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

The language.

String

False

—

false

—

patterns

String

False

—

true

—

readNamedEntity

Write named entity information.

Default: true

Boolean

True

—

false

—

sourceEncoding

Character encoding of the input data.

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

Conll2006Reader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads a file in the CoNLL-2006 format (aka CoNLL-X).


Heutzutage heutzutage ADV _ _ ADV _ _

ID - (ignored) Token counter, starting at 1 for each new sentence.
FORM - (Token) Word form or punctuation symbol.
LEMMA - (Lemma) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
CPOSTAG - (unused)
POSTAG - (POS) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
FEATS - (MorphologicalFeatures) Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.
HEAD - (Dependency) Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
DEPREL - (Dependency) Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.
PHEAD - (ignored) Projective head of current token, which is either a value of ID or zero ('0'), or an underscore if not available. Note that depending on the original treebank annotation, there may be multiple tokens an with ID of zero. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all languages), whereas the structures resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available).
PDEPREL - (ignored) Dependency relation to the PHEAD, or an underscore if not available. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.

Sentences are separated by a blank new line.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

POSTagSet

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

readDependency

—

Boolean

True

—

false

—

readLemma

—

Boolean

True

—

false

—

readMorph

—

Boolean

True

—

false

—

readPOS

—

Boolean

True

—

false

—

sourceEncoding

—

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

Conll2009Reader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads a file in the CoNLL-2009 format.

ID - (ignored) Token counter, starting at 1 for each new sentence.
FORM - (Token) Word form or punctuation symbol.
LEMMA - (Lemma) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
PLEMMA - (ignored) Automatically predicted lemma of FORM
POS - (POS) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
PPOS - (ignored) Automatically predicted major POS by a language-specific tagger
FEAT - (MorphologicalFeatures) Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.
PFEAT - (ignored) Automatically predicted morphological features (if applicable)
HEAD - (Dependency) Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
PHEAD - (ignored) Automatically predicted syntactic head
DEPREL - (Dependency) Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningfull or simply 'ROOT'.
PDEPREL - (ignored) Automatically predicted dependency relation to PHEAD
FILLPRED - (ignored) Contains 'Y' for argument-bearing tokens
PRED - (SemanticPredicate) (sense) identifier of a semantic 'predicate' coming from a current token
APREDs - (SemanticArgument) Columns with argument labels for each semantic predicate (in the ID order)

Sentences are separated by a blank new line.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

POSTagSet

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

readDependency

—

Boolean

True

—

false

—

readLemma

—

Boolean

True

—

false

—

readMorph

—

Boolean

True

—

false

—

readPOS

—

Boolean

True

—

false

—

readSemanticPredicate

—

Boolean

True

—

false

—

sourceEncoding

—

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

Conll2012Reader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads a file in the CoNLL-2009 format.

Document ID - (ignored) This is a variation on the document filename.
Part number - (ignored) Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.
Word number - (ignored)
Word itself - (document text) This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.
Part-of-Speech - (POS)
Parse bit - (Constituent) This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column.
Predicate lemma - (Lemma) The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-"
Predicate Frameset ID - (SemanticPredicate) This is the PropBank frameset ID of the predicate in Column 7.
Word sense - (ignored) This is the word sense of the word in Column 3.
Speaker/Author - (ignored) This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.
Named Entities - (NamedEntity) These columns identifies the spans representing various named entities.
Predicate Arguments - (SemanticPredicate) There is one column each of predicate argument structure information for the predicate mentioned in Column 7.
Coreference - (CoreferenceChain) Coreference chain information encoded in a parenthesis structure.

Sentences are separated by a blank new line.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ConstituentMappingLocation

Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

ConstituentTagSet

Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

String

False

—

false

—

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

POSTagSet

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

readConstituent

—

Boolean

True

—

false

—

readCoreference

—

Boolean

True

—

false

—

readLemma

Disabled by default because CoNLL 2012 format does not include lemmata for all words, only for predicates.

Boolean

True

—

false

—

readNamedEntity

—

Boolean

True

—

false

—

readPOS

—

Boolean

True

—

false

—

readSemanticPredicate

—

Boolean

True

—

false

—

readWordSense

—

Boolean

True

—

false

—

sourceEncoding

—

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

useHeaderMetadata

Use the document ID declared in the file header instead of using the filename.

Boolean

True

—

false

—

writeTracesToText

—

Boolean

False

—

false

—

Entity Annotation Results Importer

Category: Reader
Framework: GATE
Version: unknown

Import judgments from a CrowdFlower job created by the Entity Annotation Job Builder as GATE annotations.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotateSpans

—

java.lang.Boolean

—

true

—

true

apiKey

—

java.lang.String

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

jobId

—

java.lang.Long

—

true

resultASName

—

java.lang.String

—

crowdResults

—

true

resultAnnotationType

—

java.lang.String

—

true

snippetASName

—

java.lang.String

—

true

snippetAnnotationType

—

java.lang.String

—

Sentence

—

true

tokenASName

—

java.lang.String

—

true

tokenAnnotationType

—

java.lang.String

—

Token

—

true

EuropePMC Open Access Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads open-access full-text articles from the Europe PMC web service

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

cacheSize

Size of the queue to store articles loaded preemptively.

Integer

False

—

false

—

ids

List of article ids (e.g. PMC4489390) from which to retrieve the full text. NOTE: This or 'query' must be set, but not both.

String

False

—

true

—

includeAbstract

Size of the queue to store articles loaded preemptively.

Boolean

True

—

false

—

includeSubArticles

Size of the queue to store articles loaded preemptively.

Boolean

True

—

false

—

includeTitle

Size of the queue to store articles loaded preemptively.

Boolean

True

—

false

—

limit

Maximum number of full text articles to retrieve. NOTE: Only applies when 'query' is set.

Integer

False

—

false

—

numRetries

—

Integer

False

—

false

—

query

Query term used to retrieve full text articles. NOTE: This or 'ids' must be set, but not both.

String

False

—

false

—

recorderEnabled

—

Boolean

True

—

false

—

recorderJdbcUrl

—

String

False

—

false

—

recorderPassword

—

String

False

—

false

—

recorderUsername

—

String

False

—

false

—

retryOnError

—

Boolean

True

—

false

—

retrySeconds

—

Integer

False

—

false

—

sortByPublicationDate

Retrieve the most recently published articles first. NOTE: Only applies when 'query' is set.

Boolean

False

—

false

—

FSOVFileReader

Category: Reader
Framework: AlvisNLP
Version: 2012-04-30

Project-specific text file reader.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

bodySectionName

—

java.lang.String

True

—

charset

—

java.lang.String

True

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

linesLimit

—

java.lang.Integer

False

—

sizeLimit

—

java.lang.Integer

False

—

sourcePath

—

org.bibliome.util.streams.SourceStream

True

—

titleSectionName

—

java.lang.String

True

—

xmlDir

—

org.bibliome.util.files.InputDirectory

True

—

Fast Infoset Document Format

Category: Reader
Framework: GATE
Version: unknown

Format parser for GATE XML stored in the binary Fast Infoset format

GATE .cochrane.txt document format

Category: Reader
Framework: GATE
Version: unknown

Load this to allow the opening of Cochrane text documents, and choose the mime type "text/x-cochrane", or use the correct file extension.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

excludeFromFeatures

—

java.util.List

—

TI;AB

—

fieldPattern

—

java.lang.String

—

(?<CODE>[A-Z]+): (?<VALUE>.*)

—

fieldsForText

—

java.util.List

—

TI=title;ID=id;AU=authors;AB=abstract

—

ignorePattern

—

java.lang.String

—

GATE .pubMed.txt document format

Category: Reader
Framework: GATE
Version: unknown

Load this to allow the opening of PubMed text documents, and choose the mime type "text/x-pubmed"or use the correct file extension.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

excludeFromFeatures

—

java.util.List

—

TI;AB

—

fieldPattern

—

java.lang.String

—

(?<CODE>….)- (?<VALUE>.*)

—

fieldsForText

—

java.util.List

—

TI=title;PMID=id;AU=authors;AB=abstract

—

ignorePattern

—

java.lang.String

—

GATE DataSift JSON Document Format

Category: Reader
Framework: GATE
Version: unknown

Format parser for DataSift JSON files

GATE JSON Tweet Document Format

Category: Reader
Framework: GATE
Version: unknown

Format parser for Twitter JSON files

GateXMLReaderDescriptor

Category: Reader
Framework: ILSP (UIMA)
Version: 0.9

Reads GATE documents created with ILSP tools

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

InputDirectory

Directory of xml files to read in

String

False

—

false

—

InputEncoding

Character encoding for the documents. If not specified, the default system encoding will be used. Note that this parameter only applies if there is no CAS Initializer provided; otherwise, it is the CAS Initializer’s responsibility to deal with character encoding issues.

String

False

—

false

—

InputFile

Single file to be processed

String

False

—

false

—

StripExt

The file extension to strip from the original filenames. Only files with this extension will be processed by the reader.

String

False

—

false

—

GeniaJSONReader

Category: Reader
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

annotationsLayerName

—

java.lang.String

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

instanceIdFeature

—

java.lang.String

True

—

source

—

org.bibliome.util.streams.SourceStream

True

—

GeniaReader

Category: Reader
Framework: AlvisNLP
Version: 2012-04-30

Reads text files and their associated annotation files in BioNLP Shared Task format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

aDir

—

org.bibliome.util.files.InputDirectory

False

—

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

dependencyLabelFeatureName

—

java.lang.String

True

—

dependencyRelationName

—

java.lang.String

True

—

dependentRoleName

—

java.lang.String

True

—

entitiesLayerName

—

java.lang.String

False

—

equivalenceRelationName

—

java.lang.String

True

—

equivalenceRolePrefix

—

java.lang.String

True

—

headRoleName

—

java.lang.String

True

—

idFeatureKey

—

java.lang.String

False

—

layerNames

—

alvisnlp.module.types.Mapping

True

—

readA1

—

java.lang.Boolean

True

—

readA2

—

java.lang.Boolean

False

—

sectionName

—

java.lang.String

True

—

sourcePath

—

org.bibliome.util.streams.SourceStream

True

—

typeFeatureKey

—

java.lang.String

False

—

wordLayerName

—

java.lang.String

True

—

HtmlReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads the contents of a given URL and strips the HTML. Returns only the textual contents.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Set this as the language of the produced documents.

String

False

—

false

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

sourceLocation

URL from which the input is read.

String

True

—

false

—

I2B2Reader

Category: Reader
Framework: AlvisNLP
Version:

read files in the format of the I2B2 challenge.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

assertionFeature

—

java.lang.String

True

—

assertionsDir

—

org.bibliome.util.files.InputDirectory

False

—

conceptTypeFeature

—

java.lang.String

True

—

conceptsDir

—

org.bibliome.util.files.InputDirectory

False

—

conceptsLayerName

—

java.lang.String

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

leftRole

—

java.lang.String

True

—

linenoFeature

—

java.lang.String

True

—

linesLayerName

—

java.lang.String

True

—

relationsDir

—

org.bibliome.util.files.InputDirectory

False

—

rightRole

—

java.lang.String

True

—

sectionName

—

java.lang.String

True

—

textDir

—

org.bibliome.util.files.InputDirectory

True

—

tokenNumberFeature

—

java.lang.String

True

—

tokensLayerName

—

java.lang.String

True

—

ILSP File System Collection Reader

Category: Reader
Framework: ILSP (UIMA)
Version: 1.0

Reads files from the filesystem. This CollectionReader may be used with or without a CAS Initializer. If a CAS Initializer is supplied, it will be passed an InputStream to the file and must populate the CAS from that InputStream. If no CAS Initializer is supplied, this CollectionReader will read the file itself and set treat the entire contents of the file as the document to be inserted into the CAS. Uses code from the Apache UIMA framwork licensed under the ASF License.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

InputDirectory

Directory containing input files

String

True

—

false

—

InputEncoding

String

False

—

false

—

InputFile

Single file to be processed

String

False

—

false

—

InputLanguage

ISO language code for the documents

String

False

—

false

—

MaxSize

Input file allowed max size in KB.

Integer

False

—

false

—

ProcessParameters

Process parameters to be passed to an AE.

String

False

—

true

—

StripExt

The file extension to strip from the original filenames. Only files with this extension will be processed by the reader.

String

False

—

false

—

ImsCwbReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads a tab-separated format including pseudo-XML tags.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

POSTagSet

Specify which tag set should be used to locate the mapping file.

String

False

—

false

—

generateNewIds

If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. (Default: false)

Boolean

True

—

false

—

idIsUrl

If true, the unit text ID encoded in the corpus file is stored as the URI in the document meta data. This setting has is not affected by #PARAM_GENERATE_NEW_IDS (Default: false)

Boolean

True

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

readLemma

Read lemmas.

Default: true

Boolean

True

—

false

—

readPOS

Read part-of-speech tags and generate POS annotations or subclasses if a #PARAM_POS_TAG_SET tag set or #PARAM_POS_MAPPING_LOCATION mapping file is used.

Default: true

Boolean

True

—

false

—

readSentence

Read sentences.

Default: true

Boolean

True

—

false

—

readToken

Read tokens and generate Token annotations.

Default: true

Boolean

True

—

false

—

replaceNonXml

Replace non-XML characters with spaces. (Default: true)

Boolean

True

—

false

—

sourceEncoding

—

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

Input Text Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads text supplied in a parameter. This component is useful if you want to quickly process a single document by simply copy-pasting its content.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

inputText

The text to be processed.

String

True

—

false

—

JdbcReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Collection reader for JDBC database.The obtained data will be written into CAS DocumentText as well as fields of the DocumentMetaData annotation.

The field names are available as constants and begin with CAS_. Please specify the mapping of the columns and the field names in the query. For example,

SELECT text AS cas_text, title AS cas_metadata_title FROM test_table

will create a CAS for each record, write the content of "text" column into CAS documen text and that of "title" column into the document title field of the DocumentMetaData annotation.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

connection

Specifies the URL to the database. If used with uimaFIT and the value is not given, <code>jdbc:mysql://127.0.0.1/</code> will be taken.

String

True

—

false

—

database

Specifies name of the database to be accessed.

String

True

—

false

—

driver

Specify the class name of the JDBC driver. If used with uimaFIT and the value is not given, <code>com.mysql.jdbc.Driver</code> will be taken.

String

True

—

false

—

language

Specifies the language.

String

False

—

false

—

password

Specifies the password for database access.

String

True

—

false

—

query

Specifies the query.

String

True

—

false

—

user

Specifies the user name for database access.

String

True

—

false

—

KEA Corpus Importer

Category: Reader
Framework: GATE
Version: unknown

Imports a KEA-style corpus into GATE

LIBSVMReader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads a dataset in LIBSVM format

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

InputLIBSVMDataset

Folder contains svm datasets (this can be a single file)

String

True

—

false

—

LLLReader

Category: Reader
Framework: AlvisNLP
Version:

Read files and annotations in LLL format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

agentFeatureName

—

java.lang.String

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

dependenciesRelationName

—

java.lang.String

True

—

dependencyLabelFeatureName

—

java.lang.String

True

—

dependentRole

—

java.lang.String

True

—

genicAgentRole

—

java.lang.String

True

—

genicInteractionRelationName

—

java.lang.String

True

—

genicTargetRole

—

java.lang.String

True

—

headRole

—

java.lang.String

True

—

idFeatureName

—

java.lang.String

True

—

lemmaFeatureName

—

java.lang.String

True

—

sectionName

—

java.lang.String

True

—

sentenceLayerName

—

java.lang.String

True

—

source

—

org.bibliome.util.streams.SourceStream

True

—

targetFeatureName

—

java.lang.String

True

—

wordLayerName

—

java.lang.String

True

—

MediaWiki Corpus Populater

Category: Reader
Framework: GATE
Version: unknown

Populate a corpus from a MediaWiki XML dump

MediaWiki Document Format

Category: Reader
Framework: GATE
Version: unknown

Document format for parsing MediaWiki markup

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ignorableTags

—

java.util.Set

—

script;style

—

MediaWiki XML Document Format

Category: Reader
Framework: GATE
Version: unknown

Deprecated MediaWiki importer

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ignorableTags

—

java.util.Set

—

script;style

—

Merge GENIA-coref with -term Collection Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Read GENIA-coref files and GENIA-event/-term files and merge each couple into one CAS. Pre-conditions: -The number of files in 2 input directories must equal and file names must be the same. -The texts in the two corresponding files must be the same.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

InputDirectory1

Directory containing input files

String

True

—

false

—

InputDirectory2

—

String

True

—

false

—

OutputLogFile

—

String

True

—

false

—

NegraExportReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

This CollectionReader reads a file which is formatted in the NEGRA export format. The texts and add. information like constituent structure is reproduced in CASes, one CAS per text (article) .

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

POSTagSet

String

False

—

false

—

collectionId

The collection ID to the written to the document meta data. (Default: none)

String

False

—

false

—

documentUnit

What indicates if a new CAS should be started. E.g., if set to DocumentUnit#ORIGIN_NAME ORIGIN_NAME, a new CAS is generated whenever the origin name of the current sentence differs from the origin name of the last sentence. (Default: ORIGIN_NAME)

String

True

—

false

—

generateNewIds

If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. (Default: false)

Boolean

True

—

false

—

language

The language.

String

False

—

false

—

readLemma

Write lemma information.

Default: true

Boolean

True

—

false

—

readPOS

Write part-of-speech information.

Default: true

Boolean

True

—

false

—

readPennTree

Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset!

Default: false

Boolean

True

—

false

—

sourceEncoding

Character encoding of the input data.

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

True

—

false

—

OBOReader

Category: Reader
Framework: AlvisNLP
Version:

Reads terms in OBO files as documents.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

ancestorsFeature

—

java.lang.String

False

—

childrenFeature

—

java.lang.String

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

excludeOBOBuiltins

—

java.lang.Boolean

True

—

idPrefix

—

java.lang.String

True

—

nameSectionName

—

java.lang.String

True

—

oboFiles

—

java.lang.String[]

True

—

parentFeature

—

java.lang.String

True

—

pathFeature

—

java.lang.String

True

—

synonymSectionName

—

java.lang.String

True

—

PdfReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Collection reader for PDF files. Uses simple heuristics to detect headings and paragraphs.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

endPage

The last page to be extracted from the PDF.

Integer

False

—

false

—

headingType

The type used to annotate headings.

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

paragraphType

The type used to annotate paragraphs.

String

False

—

false

—

patterns

String

False

—

true

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

startPage

The first page to be extracted from the PDF.

Integer

False

—

false

—

substitutionTableLocation

The location of the substitution table use to post-process the text extracted form the PDF, e.g. to convert ligatures to separate characters.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

PennTreebankChunkedReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Penn Treebank chunked format reader.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

POSTagSet

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

readChunk

Write chunk annotations to the CAS.

Boolean

True

—

false

—

readPOS

Write part-of-speech annotations to the CAS.

Boolean

True

—

false

—

readSentence

Write sentence annotations to the CAS.

Boolean

True

—

false

—

readToken

Write token annotations to the CAS.

Boolean

True

—

false

—

sourceEncoding

Character encoding of the input data.

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

PennTreebankCombinedReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Penn Treebank combined format reader.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ConstituentMappingLocation

Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

ConstituentTagSet

String

False

—

false

—

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

POSTagSet

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

readPOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: true

Boolean

True

—

false

—

removeTraces

—

Boolean

False

—

false

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

writeTracesToText

—

Boolean

False

—

false

—

PubMed Abstract Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Fetches PubMed abstracts from NaCTeM's Kleio service.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

PubMedIDs

A list of PubMed IDs. Any format is accepted as long as IDs are separated by non-numerical characters.

String

True

—

false

—

PubTatorReader

Category: Reader
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

classFeature

—

java.lang.String

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

offsetFeature

—

java.lang.String

True

—

sourcePath

—

org.bibliome.util.streams.SourceStream

True

—

typeFeature

—

java.lang.String

True

—

RDF Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads Common Annotation Structures (CASes) from RDF-encoded files. The files have presumably been written with RDF Writer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

inputFileOrFolder

A file or folder where RDF-encoded Common Annotation Structures will be read from.

String

True

—

false

—

RTFReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Read RTF (Rich Test Format) files. Uses RTFEditorKit for parsing RTF..

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

Reuters21578SgmlReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Read a Reuters-21578 corpus in SGML format.

Set the directory that contains the SGML files with #PARAM_SOURCE_LOCATION.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

sourceLocation

The directory that contains the Reuters-21578 SGML files.

String

True

—

false

—

Reuters21578TxtReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Read a Reuters-21578 corpus that has been transformed into text format using ExtractReuters in the lucene-benchmarks project.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

sourceLocation

The directory that contains the Reuters-21578 text files, named according to the pattern #FILE_PATTERN.

String

True

—

false

—

SFTP Document Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads plain-text documents from a remote directory on a user-specified server via SFTP.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

Password

—

String

True

—

false

—

RemoteDirectory

—

String

True

—

false

—

ServerURL

—

String

True

—

false

—

Username

—

String

True

—

false

—

SFTP XMI Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Reads an XMI-formatted corpus from an SFTP-enabled server.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

Password

—

String

True

—

false

—

RemoteDirectory

—

String

True

—

false

—

ServerURL

—

String

True

—

false

—

Username

—

String

True

—

false

—

SerializedCasReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

typeSystemLocation

The file from which to obtain the type system if it is not embedded in the serialized CAS.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

Shared Task 2004 Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 0.0.1-SNAPSHOT

Reads training or evaluation data from the BioNLP/NLPBA 2004 Bio-Entity Recognition Task

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

limit

—

Integer

False

—

false

—

readTrainingData

True if training data is to be read, otherwise evaluation data will be read

Boolean

True

—

false

—

StringReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Simple reader that generates a CAS from a String. This can be useful in situations where a reader is preferred over manually crafting a CAS using JCasFactory#createJCas().

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

collectionId

The collection ID to set in the DocumentMetaData.

String

True

—

false

—

documentBaseUri

The document base URI to set in the DocumentMetaData.

String

False

—

false

—

documentId

The document ID to set in the DocumentMetaData.

String

True

—

false

—

documentText

The document text.

String

True

—

false

—

documentUri

The document URI to set in the DocumentMetaData.

String

True

—

false

—

language

Set this as the language of the produced documents.

String

True

—

false

—

TSV Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.0

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

InputFile

A tab-separated-value file containing the columns "#URI", "#type", and feature names appropriate for the types.

String

True

—

false

—

TabularReader

Category: Reader
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

addToLayer

—

java.lang.Boolean

False

—

checkNumColumns

—

java.lang.Integer

False

—

commitLines

—

java.lang.Boolean

False

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

createAnnotations

—

java.lang.Boolean

False

—

createDocuments

—

java.lang.Boolean

False

—

createRelations

—

java.lang.Boolean

False

—

createSections

—

java.lang.Boolean

False

—

createTuples

—

java.lang.Boolean

False

—

deleteElements

—

java.lang.Boolean

False

—

lineActions

—

alvisnlp.corpus.expressions.Expression[]

True

—

removeFromLayer

—

java.lang.Boolean

False

—

setArguments

—

java.lang.Boolean

False

—

setFeatures

—

java.lang.Boolean

False

—

skipBlank

—

java.lang.Boolean

False

—

source

—

org.bibliome.util.streams.SourceStream

True

—

sourceElement

—

alvisnlp.corpus.expressions.Expression

True

—

trimColumns

—

java.lang.Boolean

True

—

TcfReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reader for the WebLicht TCF format. It reads all the available annotation Layers from the TCF file and convert it to a CAS annotations. The TCF data do not have begin/end offsets for all of its annotations which is required in CAS annotation. Hence, addresses are manually calculated per tokens and stored in a map (token_id, token(CAS object)) where later we get can get the offset from the token

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

TeiReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reader for the TEI XML.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

POSTagSet

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

omitIgnorableWhitespace

Do not write ignoreable whitespace from the XML file to the CAS.

Boolean

True

—

false

—

patterns

String

False

—

true

—

readConstituent

Write constituent annotations to the CAS.

Boolean

True

—

false

—

readLemma

Write lemma annotations to the CAS.

Boolean

True

—

false

—

readNamedEntity

Write named entity annotations to the CAS.

Boolean

True

—

false

—

readPOS

Write part-of-speech annotations to the CAS.

Boolean

True

—

false

—

readParagraph

Write paragraphs annotations to the CAS.

Boolean

True

—

false

—

readSentence

Write sentence annotations to the CAS.

Boolean

True

—

false

—

readToken

Write token annotations to the CAS.

Boolean

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

useFilenameId

When not using the XML ID, use only the filename instead of the whole URL as ID. Mind that the filenames should be unique in this case.

Boolean

True

—

false

—

useXmlId

Use the xml:id attribute on the TEI elements as document ID. Mind that many TEI files may not have this attribute on all TEI elements and you may end up with no document ID at all. Also mind that the IDs should be unique.

Boolean

True

—

false

—

utterancesAsSentences

Interpret utterances "u" as sentenes "s". (EXPERIMENTAL)

Boolean

True

—

false

—

TextFileReader

Category: Reader
Framework: AlvisNLP
Version: 2010-10-28

Reads files and adds a document in the corpus for each file.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

baseNameId

—

java.lang.Boolean

False

—

charset

—

java.lang.String

True

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

linesLimit

—

java.lang.Integer

False

—

sectionName

—

java.lang.String

True

—

sizeLimit

—

java.lang.Integer

False

—

sourcePath

—

org.bibliome.util.streams.SourceStream

True

—

TextReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

UIMA collection reader for plain text files.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

TigerXmlReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

UIMA collection reader for TIGER-XML files. Also supports the augmented format used in the Semeval 2010 task which includes semantic role data.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

POSTagSet

String

False

—

false

—

ignoreIllegalSentences

If a sentence has an illegal structure (e.g. TIGER 2.0 has non-terminal nodes that do not have child nodes), then just ignore these sentences.

Default: false

Boolean

True

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

readPennTree

Default: false

Boolean

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

TreeTaggerReader

Category: Reader
Framework: AlvisNLP
Version: 2010-10-28

Read files in tree-tagger output format and creates a document for each file read.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

charset

—

java.lang.String

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

lemmaFeatureKey

—

java.lang.String

False

—

posFeatureKey

—

java.lang.String

False

—

sectionName

—

java.lang.String

True

—

sentenceLayerName

—

java.lang.String

True

—

sourcePath

—

org.bibliome.util.streams.SourceStream

True

—

wordLayerName

—

java.lang.String

True

—

TueppReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

UIMA collection reader for Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) XML files.

Only the part-of-speech with the best rank (rank 1) is read, if there is a tie between multiple tags, the first one from the XML file is read.
Only the first lemma (baseform) from the XML file is read.
Token are read, but not the specific kind of token (e.g. TEL, AREA, etc.).
Article boundaries are not read.
Paragraph boundaries are not read.
Lemma information is read, but morphological information is not read.
Chunk, field, and clause information is not read.
Meta data headers are not read.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

POSTagSet

String

False

—

false

—

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

sourceEncoding

Character encoding of the input data.

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

Twitter Collection Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version:

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

count

The number of tweets to return per page, up to a maximum of 100. Defaults to 15. Example Values: 100

Integer

False

—

false

—

debugEnabled

—

Boolean

False

—

false

—

geoCode

Returns tweets by users located within a given radius of the given latitude/longitude. The location is preferentially taking from the Geotagging API, but will fall back to their Twitter profile. The parameter value is specified by 'latitude,longitude,radius', where radius units must be specified as either 'mi' (miles) or 'km' (kilometers). Note that you cannot use the near operator via the API to geocode arbitrary locations; however you can use this geocode parameter to search near geocodes directly. A maximum of 1,000 distinct 'sub-regions' will be considered when using the radius modifier. Example Values: 37.781157,-122.398720,1mi

Float

False

—

false

—

lang

Restricts tweets to the given language, given by an ISO 639-1 code. Language detection is best-effort. Example Values: eu

String

False

—

false

—

locale

Specify the language of the query you are sending (only ja is currently effective). This is intended for language-specific consumers and the default should work in the majority of cases. Example Values: ja

String

False

—

false

—

oAuthAccessToken

—

String

False

—

false

—

oAuthAccessTokenSecret

—

String

False

—

false

—

oAuthConsumerKey

—

String

False

—

false

—

oAuthConsumerSecret

—

String

False

—

false

—

query

A UTF-8, URL-encoded search query of 1,000 characters maximum, including operators. Queries may additionally be limited by complexity. Example Values: @noradio

String

True

—

false

—

resultType

Specifies what type of search results you would prefer to receive. The current default is 'mixed' Valid values include: mixed: Include both popular and real time results in the response. recent: return only the most recent results in the response. popular: return only the most popular results in the response. Example Values: mixed, recent, popular

String

False

—

false

—

sinceId

Returns results with an ID greater than (that is, more recent than) the specified ID. There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available. Example Values: 12345

String

False

—

false

—

totalCount

The total number of tweets to return. Defaults to 1000. Example Values: 500

Integer

False

—

false

—

Twitter Corpus Populator

Category: Reader
Framework: GATE
Version: unknown

Populate a corpus from Twitter JSON containing multiple Tweets

WebOfKnowledgeReader

Category: Reader
Framework: AlvisNLP
Version:

Reads Web of Knowledge search result import files.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

source

—

org.bibliome.util.streams.SourceStream

True

—

tabularFormat

—

java.lang.Boolean

False

—

WikipediaArticleInfoReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads all general article infos without retrieving the whole Page objects

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Boolean

True

—

false

—

Database

The name of the database.

String

True

—

false

—

Host

The host server.

String

True

—

false

—

Language

The language of the Wikipedia that should be connected to.

String

True

—

false

—

Password

The password of the database account.

String

True

—

false

—

User

The username of the database account.

String

True

—

false

—

WikipediaArticleReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads all article pages. A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects, disambiguation pages, or discussion pages are regarded, however.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Boolean

True

—

false

—

Database

The name of the database.

String

True

—

false

—

Host

The host server.

String

True

—

false

—

Language

The language of the Wikipedia that should be connected to.

String

True

—

false

—

OnlyFirstParagraph

If set to true, only the first paragraph instead of the whole article is used.

Boolean

True

—

false

—

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Boolean

True

—

false

—

PageBuffer

The page buffer size (#pages) of the page iterator.

Integer

True

—

false

—

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

String

False

—

true

—

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

String

False

—

true

—

Password

The password of the database account.

String

True

—

false

—

User

The username of the database account.

String

True

—

false

—

WikipediaDiscussionReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads all discussion pages.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Boolean

True

—

false

—

Database

The name of the database.

String

True

—

false

—

Host

The host server.

String

True

—

false

—

Language

The language of the Wikipedia that should be connected to.

String

True

—

false

—

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Boolean

True

—

false

—

PageBuffer

The page buffer size (#pages) of the page iterator.

Integer

True

—

false

—

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

String

False

—

true

—

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

String

False

—

true

—

Password

The password of the database account.

String

True

—

false

—

User

The username of the database account.

String

True

—

false

—

WikipediaLinkReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Read links from Wikipedia.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

AllowedLinkTypes

Which types of links are allowed?

String

True

—

true

—

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Boolean

True

—

false

—

Database

The name of the database.

String

True

—

false

—

Host

The host server.

String

True

—

false

—

Language

The language of the Wikipedia that should be connected to.

String

True

—

false

—

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Boolean

True

—

false

—

PageBuffer

The page buffer size (#pages) of the page iterator.

Integer

True

—

false

—

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

String

False

—

true

—

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

String

False

—

true

—

Password

The password of the database account.

String

True

—

false

—

User

The username of the database account.

String

True

—

false

—

WikipediaPageReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads all Wikipedia pages in the database (articles, discussions, etc). A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects or disambiguation pages are regarded, however.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Boolean

True

—

false

—

Database

The name of the database.

String

True

—

false

—

Host

The host server.

String

True

—

false

—

Language

The language of the Wikipedia that should be connected to.

String

True

—

false

—

OnlyFirstParagraph

If set to true, only the first paragraph instead of the whole article is used.

Boolean

True

—

false

—

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Boolean

True

—

false

—

PageBuffer

The page buffer size (#pages) of the page iterator.

Integer

True

—

false

—

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

String

False

—

true

—

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

String

False

—

true

—

Password

The password of the database account.

String

True

—

false

—

User

The username of the database account.

String

True

—

false

—

WikipediaQueryReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads all article pages that match a query created by the numerous parameters of this class.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Boolean

True

—

false

—

Database

The name of the database.

String

True

—

false

—

Host

The host server.

String

True

—

false

—

Language

The language of the Wikipedia that should be connected to.

String

True

—

false

—

MaxCategories

Maximum number of categories. Articles with a higher number of categories will not be returned by the query.

Integer

False

—

false

—

MaxInlinks

Maximum number of incoming links. Articles with a higher number of incoming links will not be returned by the query.

Integer

False

—

false

—

MaxOutlinks

Maximum number of outgoing links. Articles with a higher number of outgoing links will not be returned by the query.

Integer

False

—

false

—

MaxRedirects

Maximum number of redirects. Articles with a higher number of redirects will not be returned by the query.

Integer

False

—

false

—

MaxTokens

Maximum number of tokens. Articles with a higher number of tokens will not be returned by the query.

Integer

False

—

false

—

MinCategories

Minimum number of categories. Articles with a lower number of categories will not be returned by the query.

Integer

False

—

false

—

MinInlinks

Minimum number of incoming links. Articles with a lower number of incoming links will not be returned by the query.

Integer

False

—

false

—

MinOutlinks

Minimum number of outgoing links. Articles with a lower number of outgoing links will not be returned by the query.

Integer

False

—

false

—

MinRedirects

Minimum number of redirects. Articles with a lower number of redirects will not be returned by the query.

Integer

False

—

false

—

MinTokens

Minimum number of tokens. Articles with a lower number of tokens will not be returned by the query.

Integer

False

—

false

—

OnlyFirstParagraph

If set to true, only the first paragraph instead of the whole article is used.

Boolean

True

—

false

—

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Boolean

True

—

false

—

PageBuffer

The page buffer size (#pages) of the page iterator.

Integer

True

—

false

—

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

String

False

—

true

—

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

String

False

—

false

—

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

String

False

—

true

—

Password

The password of the database account.

String

True

—

false

—

TitlePattern

SQL-style title pattern. Only articles that match the pattern will be returned by the query.

String

False

—

false

—

User

The username of the database account.

String

True

—

false

—

WikipediaRevisionPairReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads pairs of adjacent revisions of all articles.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Boolean

True

—

false

—

Database

The name of the database.

String

True

—

false

—

Host

The host server.

String

True

—

false

—

Language

The language of the Wikipedia that should be connected to.

String

True

—

false

—

MaxChange

Restrict revision pairs to cases where the length of the revisions does not differ more than this value (counted in characters).

Integer

True

—

false

—

MinChange

Restrict revision pairs to cases where the length of the revisions differ more than this value (counted in characters).

Integer

True

—

false

—

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Boolean

True

—

false

—

PageBuffer

The page buffer size (#pages) of the page iterator.

Integer

True

—

false

—

Password

The password of the database account.

String

True

—

false

—

RevisionIdFromArray

Defines an array of revision ids of the revisions that should be retrieved. (Optional)

String

False

—

true

—

RevisionIdsFromFile

Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional)

String

False

—

false

—

SkipFirstNPairs

The number of revision pairs that should be skipped in the beginning.

Integer

False

—

false

—

User

The username of the database account.

String

True

—

false

—

WikipediaRevisionReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads Wikipedia page revisions.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Boolean

True

—

false

—

Database

The name of the database.

String

True

—

false

—

Host

The host server.

String

True

—

false

—

Language

The language of the Wikipedia that should be connected to.

String

True

—

false

—

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Boolean

True

—

false

—

PageBuffer

The page buffer size (#pages) of the page iterator.

Integer

True

—

false

—

Password

The password of the database account.

String

True

—

false

—

RevisionIdFromArray

Defines an array of revision ids of the revisions that should be retrieved. (Optional)

String

False

—

true

—

RevisionIdsFromFile

Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional)

String

False

—

false

—

User

The username of the database account.

String

True

—

false

—

WikipediaTemplateFilteredArticleReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reads all pages that contain or do not contain the templates specified in the template whitelist and template blacklist.

It is possible to just define a whitelist OR a blacklist. If both whitelist and blacklist are provided, the articles are chosen that DO contain the templates from the whitelist and at the same time DO NOT contain the templates from the blacklist (= the intersection of the "whitelist page set" and the "blacklist page set")

This reader only works if template tables have been generated for the JWPL database using the WikipediaTemplateInfoGenerator.

NOTE: This reader directly extends the WikipediaReaderBase and not the WikipediaStandardReaderBase

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Boolean

True

—

false

—

Database

The name of the database.

String

True

—

false

—

DoubleCheckAssociatedPages

If this option is set, discussion pages are rejected that are associated with a blacklisted article. Analogously, articles are rejected that are associated with a blacklisted discussion page. This check is rather expensive and could take a long time. This is option is not active if only a whitelist is used. Default Value: false

Boolean

True

—

false

—

ExactTemplateMatching

Defines whether to match the templates exactly or whether to match all templates that start with the String given in the respective parameter list. Default Value: true

Boolean

True

—

false

—

Host

The host server.

String

True

—

false

—

IncludeDiscussions

Whether the reader should read also include talk pages.

Boolean

True

—

false

—

Language

The language of the Wikipedia that should be connected to.

String

True

—

false

—

LimitNUmberOfArticlesToRead

Optional parameter that allows to define the max number of articles that should be delivered by the reader. This avoids unnecessary filtering if only a small number of articles is needed.

Integer

False

—

false

—

OnlyFirstParagraph

If set to true, only the first paragraph instead of the whole article is used.

Boolean

True

—

false

—

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Boolean

True

—

false

—

PageBuffer

The page buffer size (#pages) of the page iterator.

Integer

True

—

false

—

Password

The password of the database account.

String

True

—

false

—

TemplateBlacklist

Defines templates that the articles MUST NOT contain. If you also define a whitelist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist)

String

False

—

true

—

TemplateWhitelist

Defines templates that the articles MUST contain. If you also define a blacklist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist)

String

False

—

true

—

User

The username of the database account.

String

True

—

false

—

XMI Reader

Category: Reader
Framework: NaCTeM (UIMA)
Version: 1.1

Reads common annotation structures (CAS) from files in XMI format. Files must have .xmi extension.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

files

The files to read

String

True

—

true

—

ignoreUnknownTypes

If true, allows unknown types to be ignored If false, unknown types will cause an exception Default is true

Boolean

False

—

false

—

XMLReader

Category: Reader
Framework: AlvisNLP
Version: 2010-10-28

Reads a corpus in XML files.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

sourcePath

—

org.bibliome.util.streams.SourceStream

True

—

stringParams

—

alvisnlp.module.types.Mapping

False

—

xslTransform

—

org.bibliome.util.streams.SourceStream

False

—

XMLReader2

Category: Reader
Framework: AlvisNLP
Version: 2012-04-30

Reads XML files and creates elements.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

html

—

java.lang.Boolean

False

—

rawTagNames

—

java.lang.Boolean

False

—

sourcePath

—

org.bibliome.util.streams.SourceStream

True

—

stringParams

—

alvisnlp.module.types.Mapping

False

—

xslTransform

—

org.bibliome.util.streams.SourceStream

True

—

XcesReaderDescriptor

Category: Reader
Framework: ILSP (UIMA)
Version: 1.7

Reads XCES XML files.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

InputDirectory

Directory of xml files to read in

String

False

—

false

—

InputEncoding

String

False

—

false

—

InputFile

Single file to be processed

String

False

—

false

—

ProcessBoilerplate

—

Boolean

False

—

false

—

StripExt

The file extension to strip from the original filenames. Only files with this extension will be processed by the reader.

String

False

—

false

—

XcesType

The type of XCES files: basic (with paragraph segmentation only) and annot (with sentence boudaries and token annotations up to lemma).

String

False

—

false

—

XmiReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reader for UIMA XMI files.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

lenient

In lenient mode, unknown types are ignored and do not cause an exception to be thrown.

Boolean

True

—

false

—

patterns

String

False

—

true

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

XmlReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Reader for XML files.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

DocIdTag

tag which contains the docId

String

False

—

false

—

ExcludeTag

optional, tags those should not be worked on. Out them should no text be extracted and also no Annotations be produced.

String

True

—

true

—

IncludeTag

optional, tags those should be worked on (if empty, then all tags except those ExcludeTags will be worked on)

String

True

—

true

—

collectionId

The collection ID to set in the DocumentMetaData.

String

False

—

false

—

language

Set this as the language of the produced documents.

String

False

—

false

—

sourceLocation

Location from which the input is read.

String

True

—

false

—

XmlTextReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

includeHidden

Include hidden files and directories.

Boolean

True

—

false

—

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

String

False

—

false

—

patterns

String

False

—

true

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

XmlXPathReader

Category: Reader
Framework: DKPro Core (UIMA)
Version: 1.8.0

A component reader for XML files implemented with XPath.

This is currently optimized for TREC format, which means the style topics are presented in. You should provide the parameter XPath expression that of the parent node And the child nodes of each parent node will be stored separately in its own CAS.

If your expression evaluates to leaf nodes, empty CASes will be created.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

caseSensitive

States whether the matching is done case sensitive. (default: true)

Boolean

False

—

false

—

docIdTag

Tag which contains the docId. If it is given, it will be ensured that within the same document there is only one id tag and it is not empty

String

False

—

false

—

excludeTags

Tags which should be ignored. If empty then all tags will be processed.

If this and PARAM_INCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed.

String

True

—

true

—

includeTags

Tags which should be worked on. If empty then all tags will be processed.

If this and PARAM_EXCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed.

String

True

—

true

—

language

Language of the documents. If given, it will be set in each CAS.

String

False

—

false

—

patterns

String

True

—

true

—

rootXPath

Specifies the XPath expression to all nodes to be processed. Different segments will be separated via PARAM_ID_TAG, and each segment will be stored in a separate CAS.

String

True

—

false

—

sourceLocation

Location from which the input is read.

String

False

—

false

—

useDefaultExcludes

Use the default excludes.

Boolean

True

—

false

—

workingDir

Specify to substitute tag names in CAS. Please give the substitutions each in before - after order. For example to substitute "foo" with "bar", and "hey" with "ho", you can provide { "foo", "bar", "hey", "ho" }.

String

False

—

true

—

SRL (2)

ClearNlpSemanticRoleLabeler

Category: SRL
Framework: DKPro Core (UIMA)
Version: 1.8.0

ClearNLP semantic role labeller.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

expandArguments

Normally the arguments point only to the head words of arguments in the dependency tree. With this option enabled, they are expanded to the text covered by the minimal and maximal token offsets of all descendants (or self) of the head word.

Warning: this parameter should be used with caution! For one, if the descentants of a head word cover a non-continuous region of the text, this information is lost. The arguments will appear to be spanning a continuous region. For another, the arguments may overlap with each other. E.g. if a sentence contains a relative clause with a verb, the subject of the main clause may be recognized as a dependent of the verb and may cause the whole main clause to be recorded in the argument.

Boolean

True

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

String

False

—

false

—

predModelLocation

Location from which the predicate identifier model is read.

String

False

—

false

—

printTagSet

Write the tag set(s) to the log when a model is loaded.

Boolean

True

—

false

—

roleModelLocation

Location from which the roleset classification model is read.

String

False

—

false

—

srlModelLocation

Location from which the semantic role labeling model is read.

String

False

—

false

—

MateSemanticRoleLabeler

Category: SRL
Framework: DKPro Core (UIMA)
Version: 1.8.0

DKPro Annotator for the MateTools Semantic Role Labeler.

Please cite the following paper, if you use the semantic role labeler Anders Björkelund, Love Hafdell, and Pierre Nugues. Multilingual semantic role labeling. In Proceedings of The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 43--48, Boulder, June 4--5 2009.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

Scripted analytics (6)

Groovy scripting PR

Category: Scripted analytics
Framework: GATE
Version: unknown

Runs a Groovy script as a processing resource

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

scriptParams

—

gate.FeatureMap

—

true

scriptURL

—

java.net.URL

—

JAPE Transducer

Category: Scripted analytics
Framework: GATE
Version: unknown

A module for executing Jape grammars.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationAccessors

—

java.util.List

—

binaryGrammarURL

—

java.net.URL

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

enableDebugging

—

java.lang.Boolean

—

false

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

inputASName

—

java.lang.String

—

true

ontology

—

gate.creole.ontology.Ontology

—

true

operators

—

java.util.List

—

outputASName

—

java.lang.String

—

true

JAPE-Plus Transducer

Category: Scripted analytics
Framework: GATE
Version: unknown

An optimised, JAPE-compatible transducer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationAccessors

—

java.util.List

—

binaryGrammarURL

—

java.net.URL

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

enableDebugging

—

java.lang.Boolean

—

false

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

inputASName

—

java.lang.String

—

true

ontology

—

gate.creole.ontology.Ontology

—

true

operators

—

java.util.List

—

outputASName

—

java.lang.String

—

true

RunProlog

Category: Scripted analytics
Framework: AlvisNLP
Version:

Runs a Prolog program with the corpus data structure encoded as facts.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

addToLayer

—

java.lang.Boolean

False

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

createAnnotations

—

java.lang.Boolean

False

—

createDocuments

—

java.lang.Boolean

False

—

createRelations

—

java.lang.Boolean

False

—

createSections

—

java.lang.Boolean

False

—

createTuples

—

java.lang.Boolean

False

—

deleteElements

—

java.lang.Boolean

False

—

facts

—

org.bibliome.alvisnlp.modules.prolog.FactDefinition[]

True

—

goals

—

org.bibliome.alvisnlp.modules.prolog.GoalDefinition[]

True

—

removeFromLayer

—

java.lang.Boolean

False

—

setArguments

—

java.lang.Boolean

False

—

setFeatures

—

java.lang.Boolean

False

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

theory

—

org.bibliome.util.streams.SourceStream

True

—

Script

Category: Scripted analytics
Framework: AlvisNLP
Version: 2010-10-28

Runs a script.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

language

—

java.lang.String

True

—

script

—

java.lang.String

True

—

UIMA Analysis Engine

Category: Scripted analytics
Framework: GATE
Version: unknown

Wrapper for a Text Analysis Engine from UIMA.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

analysisEngineDescriptor

—

java.net.URL

—

annotationSetName

—

java.lang.String

—

true

document

—

gate.Document

—

true

mappingDescriptor

—

java.net.URL

—

Segmenter (55)

ANNIE English Tokeniser

Category: Segmenter
Framework: GATE
Version: unknown

A customisable English tokeniser.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

tokeniserRulesURL

—

java.net.URL

—

resources/tokeniser/DefaultTokeniser.rules

—

transducerGrammarURL

—

java.net.URL

—

resources/tokeniser/postprocess.jape

—

ANNIE Sentence Splitter

Category: Segmenter
Framework: GATE
Version: unknown

ANNIE sentence splitter.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerListsURL

—

java.net.URL

—

resources/sentenceSplitter/gazetteer/lists.def

—

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

transducerURL

—

java.net.URL

—

resources/sentenceSplitter/grammar/main-single-nl.jape

—

Arabic Tokeniser

Category: Segmenter
Framework: GATE
Version: unknown

A customisable English tokeniser.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

tokeniserRulesURL

—

java.net.URL

—

resources/tokeniser/arabicTokeniser.rules

—

transducerGrammarURL

—

java.net.URL

—

resources/tokeniser/postprocess.jape

—

ArktweetTokenizer

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

ArkTweet tokenizer.

Banner Base Tokenizer

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Banner Simple Tokenizer

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Tokens ouput by this tokenizer consist of a contiguous block of alphanumeric characters or a single punctuation mark. Note, therefore, that any * construction which contains a punctuation mark (such as a contraction or a real number) will necessarily span over at least three tokens.

Banner Whitespace Tokenizer

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

* Instances of this class tokenize {@link Sentence}s only at whitespace characters. All other boundaries (such as between alphabetic characters and * punctuation) are ignored.

BreakIteratorSegmenter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

BreakIterator segmenter.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

The language.

String

False

—

false

—

splitAtApostrophe

Per default the Java BreakIterator does not split off contractions like John’s into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered.

Boolean

True

—

false

—

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Boolean

True

—

false

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

Cafetiere Sentence Splitter

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Uses a set of heuristics and patterns to find sentence boundaries. Works with English.

CamelCaseTokenSegmenter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Split up existing tokens again if they are camel-case text.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

deleteCover

Wether to remove the original token.

Default: true

Boolean

True

—

false

—

Cebuano Gazetteer Tokeniser

Category: Segmenter
Framework: GATE
Version: unknown

A list lookup component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerFeatureSeparator

—

java.lang.String

—

listsURL

—

java.net.URL

—

resources/tokeniser/lists.def

—

longestMatchOnly

—

java.lang.Boolean

—

true

—

true

wholeWordsOnly

—

java.lang.Boolean

—

true

—

true

Cebuano Tokeniser

Category: Segmenter
Framework: GATE
Version: unknown

A customisable English tokeniser.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

tokeniserRulesURL

—

java.net.URL

—

resources/tokeniser/DefaultTokeniser.rules

—

transducerGrammarURL

—

java.net.URL

—

resources/tokeniser/postprocess.jape

—

Chinese Segmenter PR

Category: Segmenter
Framework: GATE
Version: unknown

Segment the Chinese text into words, based on the PAUM learning algorithm.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

learningAlg

—

java.lang.String

—

PAUM

—

true

learningMode

—

gate.chineseSeg.RunMode

—

SEGMENTING

—

true

modelURL

—

java.net.URL

—

true

textCode

—

java.lang.String

—

UTF-8

—

true

textFilesURL

—

java.net.URL

—

true

ClearNlpSegmenter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Tokenizer using Clear NLP.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

strictZoning

Boolean

True

—

false

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

CompoundAnnotator

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Annotates compound parts and linking morphemes.

Freeling Sentence Splitter

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Performs tokenisation. Operates on English (en), Spanish (es) and Catalan (ca), Asturian (ast), Welsh (cy), Galician (gl), Italian (it) and Portuguese (pt) by setting the "language" parameter (default is English).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

—

String

True

—

false

—

FreelingTokenizer

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

—

String

True

—

false

—

GATE Unicode Tokeniser

Category: Segmenter
Framework: GATE
Version: unknown

A customisable Unicode tokeniser.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

rulesURL

—

java.net.URL

—

resources/tokeniser/DefaultTokeniser.rules

—

GENIA Sentence Splitter

Category: Segmenter
Framework: GATE
Version: unknown

A processing resource that takes document and corpus parameters

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

splitterBinary

—

java.net.URL

—

true

GENIA Sentence Splitter

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Machine learning-based sentence splitter optimized for biomedical texts. Features: - The classification model is based on supervised leaning method using maximum entropy modeling (using simple MaxEnt library). - Trained on the GENIA corpus. The classifier achieved an F-score of 99.7 on 200 unseen GENIA abstracts. Website: http://www.nactem.ac.uk/y-matsu/geniass/

Hashtag Tokenizer

Category: Segmenter
Framework: GATE
Version: unknown

Tokenizes Multi-Word Hashtags

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

gazetteerURL

—

java.net.URL

—

resources/hashtag/gazetteer/lists.def

—

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

Hindi Splitter

Category: Segmenter
Framework: GATE
Version: unknown

A Sentence Splitter.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

gazetteerListsURL

—

java.net.URL

—

resources/splitter/gazetteer/lists.def

—

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

transducerURL

—

java.net.URL

—

resources/splitter/grammar/main.jape

—

Hindi Tokeniser

Category: Segmenter
Framework: GATE
Version: unknown

A customisable Hindi tokeniser.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

tokeniserRulesURL

—

java.net.URL

—

resources/tokeniser/multiTokeniser.rules

—

transducerGrammarURL

—

java.net.URL

—

resources/tokeniser/postprocess.jape

—

ILSP Paragraph, Sentence and Token Segmentor

Category: Segmenter
Framework: ILSP (UIMA)
Version: 1.15

This module is a regex and abbreviation based segmentor targetting texts written in Greek.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

Mode

mode: default: let ilsp-sst decide on sentence splits; nla: force ilsp-sst to always use newlines as sentence splits; nlo: force ilsp-sst to use only newlines as sentence splits

String

False

—

false

—

IULATokenizer

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Performs paragraph splitting, sentence splitting, and tokenisation. Also detects proper names. Operates on Spanish (es) and Catalan (ca), by setting the "language" parameter (default is Spanish).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

—

String

True

—

false

—

JTokSegmenter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

JTok segmenter.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

The language.

String

False

—

false

—

strictZoning

Boolean

True

—

false

—

writeParagraph

Create Paragraph annotations.

Boolean

True

—

false

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

LanguageToolSegmenter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Segmenter using LanguageTool to do the heavy lifting. LanguageTool internally uses different strategies for tokenization.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

The language.

String

False

—

false

—

strictZoning

Boolean

True

—

false

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

LineBasedSentenceSegmenter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Annotates each line in the source text as a sentence. This segmenter is not capable of creating tokens! All respective parameters have no functionality.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

The language.

String

False

—

false

—

strictZoning

Boolean

True

—

false

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

LingPipe Sentence Splitter

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Sentence splitter based on LingPipe models. Website: http://alias-i.com/lingpipe/

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

useBiomedicalModel

true if the LingPipe MEDLINE sentence model should be used

Boolean

False

—

false

—

LingPipe Sentence Splitter PR

Category: Segmenter
Framework: GATE
Version: unknown

Provides an interface to LingPipe sentence splitter API.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

outputASName

—

java.lang.String

—

true

LingPipe Tokenizer PR

Category: Segmenter
Framework: GATE
Version: unknown

Provides a LingPipe tokenizer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

outputASName

—

java.lang.String

—

true

MLRS Maltese Tokeniser

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Tokenises Maltese text

MLRS Paragraph Splitter

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Identifies the paragraphs in the text, creating a Paragraph annotation for each one

MLRS Sentence Splitter

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Identifies the sentences in the text, creating a Sentence annotation for each

OSCAR 4 Tokeniser

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Segments text into tokens. Derived from the OSCAR 4 chemical NER tool, this tokeniser is specifically tuned for processing chemical text.

OgmiosTokenizer

Category: Segmenter
Framework: AlvisNLP
Version: 2010-10-28

Tokenizes the sections contents according to the Ogmios tokenizer specifications.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

separatorTokens

—

java.lang.Boolean

True

—

targetLayerName

—

java.lang.String

True

—

tokenTypeFeature

—

java.lang.String

True

—

OpenNLP Sentence Splitter

Category: Segmenter
Framework: GATE
Version: unknown

Sentence splitter using an OpenNLP maxent model

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

model

—

java.net.URL

—

models/english/en-sent.bin

—

OpenNLP Tokenizer

Category: Segmenter
Framework: GATE
Version: unknown

Tokenizer using an OpenNLP maxent model

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

model

—

java.net.URL

—

models/english/en-token.bin

—

OpenNLPTokenizer

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

Tokenize the text and create token annotations that span the tokens. The tokenization is performed using the OpenNLP MaxEnt tokenizer, which tokenizes according to the Penn Tree Bank tokenization standard. In general, tokens are separated by white space, but punctuation marks (e.g., ".", ",", "!", "?", etc.) and apostrophed endings (e.g., "'s", "'nt", etc.) are separate tokens.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ModelFile

OpenNLP MaxEnt model file for the tokenizer.

String

True

—

false

—

OpenNlpSegmenter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Tokenizer and sentence splitter using OpenNLP.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

segmentationModelLocation

Load the segmentation model from this location instead of locating the model automatically.

String

False

—

false

—

strictZoning

Boolean

True

—

false

—

tokenizationModelLocation

Load the tokenization model from this location instead of locating the model automatically.

String

False

—

false

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

ParagraphSplitter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

This class creates paragraph annotations for the given input document. It searches for the occurrence of two or more line-breaks (Unix and Windows) and regards this as the boundary between paragraphs.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

splitPattern

A regular expression used to detect paragraph splits.

Default: #DOUBLE_LINE_BREAKS_PATTERN (split on two consecutive line breaks)

String

True

—

false

—

PatternBasedTokenSegmenter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Split up existing tokens again at particular split-chars. The prefix states whether the split chars should be added as separate Token Tokens. If the #INCLUDE_PREFIX precedes the split pattern, the pattern is included. Consequently, patterns following the #EXCLUDE_PREFIX, will not be added as a Token.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

deleteCover

Wether to remove the original token.

Default: true

Boolean

True

—

false

—

patterns

A list of regular expressions, prefixed with #INCLUDE_PREFIX or #EXCLUDE_PREFIX. If neither of the prefixes is used, #EXCLUDE_PREFIX is assumed.

String

True

—

true

—

Penn BioTokenizer

Category: Segmenter
Framework: GATE
Version: unknown

Tokenizer for biomedical text

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

tokenizerURL

—

java.net.URL

—

resources/BioTok.bin.gz

—

RASP2 Tokenizer

Category: Segmenter
Framework: GATE
Version: unknown

RASP2 Tokenizer. Faster than the original GATE component but generates Tokens which have only a 'string' feature. Requires annotations of type Sentence. See RASP package for platform restrictions.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

charset

—

java.lang.String

—

ISO-8859-1

—

true

debug

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

RegEx Sentence Splitter

Category: Segmenter
Framework: GATE
Version: unknown

A sentence splitter based on regular expressions.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

externalSplitListURL

—

java.net.URL

—

resources/regex-splitter/external-split-patterns.txt

—

internalSplitListURL

—

java.net.URL

—

resources/regex-splitter/internal-split-patterns.txt

—

nonSplitListURL

—

java.net.URL

—

resources/regex-splitter/non-split-patterns.txt

—

outputASName

—

java.lang.String

—

true

RegexTokenizer

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

The default behaviour is to split sentences by a line break and tokens by whitespace.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

The language.

String

False

—

false

—

sentenceBoundaryRegex

Define the sentence boundary. Default: \n (assume one sentence per line).

String

True

—

false

—

strictZoning

Boolean

True

—

false

—

tokenBoundaryRegex

Defines the pattern that is used as token end boundary. Default: [\s\n]+ (matching whitespace and linebreaks. When setting custom patterns, take into account that the final token is often terminated by a linebreak rather than the boundary character. Therefore, the newline typically has to be added to the group of matching characters, e.g. "tokenized-text" is correctly tokenized with the pattern [-\n].

String

True

—

false

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

Romanian Tokeniser

Category: Segmenter
Framework: GATE
Version: unknown

A customisable Romanian tokeniser.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

tokeniserRulesURL

—

java.net.URL

—

resources/Tokeniser/OBtokeniser.rules

—

transducerGrammarURL

—

java.net.URL

—

resources/Tokeniser/postprocess.jape

—

Stanford PTB Tokenizer

Category: Segmenter
Framework: GATE
Version: unknown

Stanford Penn Treebank v3 Tokenizer, for English

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

false

—

true

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

spaceLabel

—

java.lang.String

—

SpaceToken

—

true

tokenLabel

—

java.lang.String

—

Token

—

true

StanfordSegmenter

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

allowEmptySentences

Whether to generate empty sentences.

Boolean

True

—

false

—

boundaryFollowers

This is a Set of String that are matched with .equals() which are allowed to be tacked onto the end of a sentence after a sentence boundary token, for example ")".

String

False

—

true

—

boundaryToDiscard

The set of regex for sentence boundary tokens that should be discarded.

String

False

—

true

—

boundaryTokenRegex

The set of boundary tokens. If null, use default.

String

False

—

false

—

isOneSentence

Whether to treat all input as one sentence.

Boolean

True

—

false

—

language

The language.

String

False

—

false

—

languageFallback

—

String

False

—

false

—

newlineIsSentenceBreak

Strategy for treating newlines as paragraph breaks.

String

False

—

false

—

regionElementRegex

A regular expression for element names containing a sentence region. Only tokens in such elements will be included in sentences. The start and end tags themselves are not included in the sentence.

String

False

—

false

—

strictZoning

Boolean

True

—

false

—

tokenRegexesToDiscard

The set of regex for sentence boundary tokens that should be discarded.

String

False

—

true

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

xmlBreakElementsToDiscard

These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary.

String

False

—

true

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

TokenMerger

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Merges any Tokens that are covered by a given annotation type. E.g. this component can be used to create a single tokens from all tokens that constitute a multi-token named entity.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Override the tagset mapping.

String

False

—

false

—

annotationType

Annotation type for which tokens should be merged.

String

True

—

false

—

constraint

A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to merge only tokens that are part of a location named entity.

String

False

—

false

—

language

Use this language instead of the document language to resolve the model and tag set mapping.

String

False

—

false

—

lemmaMode

Configure what should happen to the lemma of the merged tokens. It is possible to JOIN the lemmata to a single lemma (space separated), to REMOVE the lemma or LEAVE the lemma of the first token as-is.

String

True

—

false

—

posType

Set a new POS tag for the new merged token. This is the mapped type. If this is specified, tag set mapping will not be performed. This parameter has no effect unless PARAM_POS_VALUE is also set.

String

False

—

false

—

posValue

Set a new POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset).

String

False

—

false

—

TokenTrimmer

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Remove prefixes and suffixes from tokens.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

prefixes

List of prefixes to remove.

String

True

—

true

—

suffixes

List of suffixes to remove.

String

True

—

true

—

TrailingCharacterRemover

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

Removing trailing character (sequences) from tokens, e.g. punctuation.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

minTokenLength

All tokens that are shorter than the minimum token length after removing trailing chars are completely removed. By default (1), empty tokens are removed. Set to 0 or a negative value if no tokens should be removed. Shorter tokens that do not have trailing chars removed are always retained, regardless of their length.

Integer

True

—

false

—

pattern

A regex to be trimmed from the end of tokens. Default: "[\\Q,-“^»*’()&/\"'©§'—«·=\\E0-9A-Z]+" (remove punctuations, special characters and capital letters).

String

True

—

false

—

[[Twitter_Tokenizer_(EN)]] ==== Twitter Tokenizer (EN)

Category: Segmenter
Framework: GATE
Version: unknown

Tokenizer tuned for Tweets

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

tokeniserRulesURL

—

java.net.URL

—

resources/tokeniser/DefaultTokeniser.rules

—

transducerGrammarURL

—

java.net.URL

—

resources/tokeniser/twitter+English.jape

—

UAICTokenizerDescriptor

Category: Segmenter
Framework: NaCTeM (UIMA)
Version: 1.0

WhitespaceTokenizer

Category: Segmenter
Framework: DKPro Core (UIMA)
Version: 1.8.0

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

If PARAM_WRITE_SENTENCES is set to true, one sentence per line is assumed. Otherwise, no sentences are created.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

The language.

String

False

—

false

—

strictZoning

Boolean

True

—

false

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

Semantics (2)

Semantic Enrichment PR

http://factforge.net/sparql

Category: Semantics
Framework: GATE
Version: unknown

The Semantic Enrichment PR allows adding new data to semantic annotations by querying external RDF (Linked Data) repositories.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationTypes

—

java.util.List

—

true

corpus

—

gate.Corpus

—

true

deleteOnNoRelations

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

query

—

java.lang.String

—

true

repositoryUrl

—

java.lang.String

—

—

version

—

java.lang.String

—

to be loaded from jar manifest

—

SemanticFieldAnnotator

Category: Semantics
Framework: DKPro Core (UIMA)
Version: 1.8.0

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. This could be a lexical resource such as WordNet or a simple key-value map. The annotation is stored in the SemanticField annotation type.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationType

Annotation types which should be annotated with semantic fields

String

True

—

false

—

constraint

A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to annotate only tokens with semantic fields that are part of a location named entity.

String

False

—

false

—

Sentiment (1)

Textalytics Sentiment Analysis

http://textalytics.com/core/sentiment-1.1

Category: Sentiment
Framework: GATE
Version: unknown

Textalytics Sentiment Analysis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiURL

—

java.lang.String

—

—

true

concepts

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

true

document

—

gate.Document

—

true

entities

—

java.lang.String

—

true

inputASTypes

—

java.util.List

—

true

inputASname

—

java.lang.String

—

true

key

—

java.lang.String

—

true

model

—

java.lang.String

—

true

outputASname

—

java.lang.String

—

Textalytics

—

true

Spelling/Grammar (5)

CorrectionsContextualizer

Category: Spelling/Grammar
Framework: DKPro Core (UIMA)
Version: 1.8.0

This component assumes that some spell checker has already been applied upstream (e.g. Jazzy). It then uses ngram frequencies from a frequency provider in order to rank the provided corrections.

JazzyChecker

Category: Spelling/Grammar
Framework: DKPro Core (UIMA)
Version: 1.8.0

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ScoreThreshold

Determines the maximum edit distance (as an int value) that a suggestion for a spelling error may have. E.g. if set to one suggestions are limited to words within edit distance 1 to the original word.

Integer

True

—

false

—

modelEncoding

The character encoding used by the model.

String

True

—

false

—

modelLocation

Location from which the model is read. The model file is a simple word-list with one word per line.

String

True

—

false

—

LanguageToolChecker

Category: Spelling/Grammar
Framework: DKPro Core (UIMA)
Version: 1.8.0

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

NorvigSpellingCorrector

Category: Spelling/Grammar
Framework: DKPro Core (UIMA)
Version: 1.8.0

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

Textalytics Spell, Grammar and Style Proofreading

http://textalytics.com/core/stilus-1.1

Category: Spelling/Grammar
Framework: GATE
Version: unknown

Textalytics Spell, Grammar and Style Proofreading

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiURL

—

java.lang.String

—

—

true

confusion

—

java.lang.Boolean

—

true

consonantRed

—

java.lang.Boolean

—

true

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

true

dictionary

—

java.lang.String

—

true

document

—

gate.Document

—

true

foreign

—

java.lang.Boolean

—

true

inputASTypes

—

java.util.List

—

true

inputASname

—

java.lang.String

—

true

key

—

java.lang.String

—

true

lang

—

java.lang.String

—

true

manyErrors

—

java.lang.String

—

true

openingClosing

—

java.lang.Boolean

—

true

outputASname

—

java.lang.String

—

Textalytics

—

true

percentage

—

java.lang.Boolean

—

true

prefixed

—

java.lang.Boolean

—

true

properNouns

—

java.lang.Boolean

—

true

punctuation

—

java.lang.Boolean

—

true

quotesOrItalics

—

java.lang.Boolean

—

true

spacing

—

java.lang.Boolean

—

true

tautologyAndLanMisuse

—

java.lang.Boolean

—

true

too_longSent

—

java.lang.Boolean

—

true

Stemmer (4)

BulStem

Category: Stemmer
Framework: GATE
Version: unknown

This plugin is an implementation of the BulStem stemmer algorithm for Bulgarian developed by Preslav Nakov.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

annotationType

—

java.lang.String

—

Token

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

pathToRules

—

java.net.URL

—

resources/stem_rules_context_2_UTF-8.txt

—

PorterStemmer

Category: Stemmer
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

formFeature

—

java.lang.String

True

—

language

—

java.lang.String

True

—

layerName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

stemFeature

—

java.lang.String

True

—

SnowballStemmer

Category: Stemmer
Framework: DKPro Core (UIMA)
Version: 1.8.0

UIMA wrapper for the Snowball stemmer. Annotation types to be stemmed can beconfigured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

filterConditionOperator

Specifies the operator for a filtering condition. It is only used if <code>PARAM_FILTER_FEATUREPATH</code> is set.

String

False

—

false

—

filterConditionValue

Specifies the value for a filtering condition. It is only used if <code>PARAM_FILTER_FEATUREPATH</code> is set.

String

False

—

false

—

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify <code>PARAM_FILTER_CONDITION_OPERATOR</code> and <code>PARAM_FILTER_CONDITION_VALUE</code>.

String

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

lowerCase

Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.

<table border="1" cellspacing="0"> <caption>Examples</caption> <tr><th></th><th>false (default)</th><th>true</th></tr> <tr><td>EDUCATIONAL</td><td>EDUCATIONAL</td><td>educ</td></tr> <tr><td>Educational</td><td>Educat</td><td>educ</td></tr> <tr><td>educational</td><td>educ</td><td>educ</td></tr> </table>

Boolean

False

—

false

—

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

String

False

—

true

—

Stemmer PR

Category: Stemmer
Framework: GATE
Version: unknown

Wrapper for the Snowball stemmer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationFeature

—

java.lang.String

—

string

—

true

annotationSetName

—

java.lang.String

—

true

annotationType

—

java.lang.String

—

Token

—

true

document

—

gate.Document

—

true

language

—

java.lang.String

—

english

—

Tagger (52)

ABNER Tagger

Category: Tagger
Framework: GATE
Version: unknown

GATE wrapper over ABNER

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

abnerMode

—

gate.abner.AbnerRunMode

—

BIOCREATIVE

—

true

annotationName

—

java.lang.String

—

Tagger

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

outputASName

—

java.lang.String

—

true

ANNIE POS Tagger

Category: Tagger
Framework: GATE
Version: unknown

Mark Hepple's Brill-style POS tagger

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

baseSentenceAnnotationType

—

java.lang.String

—

Sentence

—

true

baseTokenAnnotationType

—

java.lang.String

—

Token

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

inputASName

—

java.lang.String

—

true

lexiconURL

—

java.net.URL

—

resources/heptag/lexicon

—

outputASName

—

java.lang.String

—

true

outputAnnotationType

—

java.lang.String

—

Token

—

true

posTagAllTokens

—

java.lang.Boolean

—

true

—

true

rulesURL

—

java.net.URL

—

resources/heptag/ruleset

—

Anatomical Entity Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

Tags anatomical entities using Brown, UMLS and OBO Anatomy dictionary features

ArktweetPosTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Wrapper for Twitter Tokenizer and POS Tagger. As described in: Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters In Proceedings of NAACL 2013.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

language

Use this language instead of the document language to resolve the model and tag set mapping.

String

False

—

false

—

modelLocation

Location from which the model is read.

String

False

—

false

—

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

String

False

—

false

—

BANNER CRF Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

A UIMA wrapper for BANNER entity tagger. BANNER uses CRF.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ModelFile

File location of CRF trained model generated by BANNER, abstract path is recommended. If not specified, BANNER’s default model is used.

String

False

—

false

—

TypeToBioSuffixMap

Mappings from BIO suffix to the UIMA type names.

String

True

—

true

—

UseNumericNormalization

—

Boolean

True

—

false

—

UseParenthesisPostProcessing

—

Boolean

True

—

false

—

BioCreative Gene Mention Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 0.0.1-SNAPSHOT

Tags Gene mentions using a model trained on BioCreative GM task data, with Entrez Gene and UMLS dictionary features.

CCGPosTagger

Category: Tagger
Framework: AlvisNLP
Version: 2012-04-30

Applies the CCG POS tagger on annotations.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

executable

—

org.bibliome.util.files.ExecutableFile

True

—

formFeatureName

—

java.lang.String

True

—

internalEncoding

—

java.lang.String

True

—

keepPreviousPos

—

java.lang.Boolean

False

—

maxRuns

—

java.lang.Integer

True

—

model

—

org.bibliome.util.files.InputDirectory

True

—

posFeatureName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

True

—

silent

—

java.lang.Boolean

False

—

wordLayerName

—

java.lang.String

True

—

CRF++ Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

Uses Conditional Random Fields model for labeling. Based on CRF++, an implementation of CRF for labeling sequential data (http://crfpp.googlecode.com/svn/trunk/doc/index.html).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

IgnoreMalformedSequences

Weather malformed sequences such as {O, I-X, O} or {B-X, I-Y} should be ignored. If false then the algorithm will attempt to create annotations.

Boolean

True

—

false

—

IgnoreUnknownTypes

—

Boolean

True

—

false

—

ModelFileName

Specifies the filename to store the model in.

String

True

—

false

—

Cebuano POS Tagger

Category: Tagger
Framework: GATE
Version: unknown

Mark Hepple's Brill-style POS tagger, adapted for languages where entries are multiword

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

inputASName

—

java.lang.String

—

true

lexiconURL

—

java.net.URL

—

resources/postag/lexicon

—

rulesURL

—

java.net.URL

—

resources/postag/ruleset

—

Chemistry Tagger

Category: Tagger
Framework: GATE
Version: unknown

A tagger for chemical names.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

compoundListsURL

—

java.net.URL

—

resources/compound.def

—

document

—

gate.corpora.DocumentImpl

—

true

elementListsURL

—

java.net.URL

—

resources/element.def

—

elementMapURL

—

java.net.URL

—

resources/element_map.txt

—

removeElements

—

java.lang.Boolean

—

true

—

true

transducerGrammarURL

—

java.net.URL

—

resources/main.jape

—

ClearNlpPosTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Part-of-Speech annotator using Clear NLP. Requires Sentences to be annotated before.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

dictLocation

Load the dictionary from this location instead of locating the dictionary automatically.

String

False

—

false

—

dictVariant

Override the default variant used to locate the dictionary.

String

False

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Boolean

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the pos-tagging model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the pos-tagging model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Boolean

True

—

false

—

FreelingTagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

Performs tokenisation, lemmatisation and POS tagging. Operates on English (en). Spanish (es) and Catalan (ca), Welsh (cy), Galician (gl), Italian (it) and Portuguese (pt) by setting the "language" parameter (default is English).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

—

String

True

—

false

—

GENIA Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

Tags biological named entities: proteins, cell lines, cell types, DNAs, and RNAs. It has its own tokeniser, part-of-speech tagger, and shallow parser. The models were trained on the GENIA corpus. Project website: http://www.nactem.ac.uk/GENIA/tagger/

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

chunkTag

If true, chunk tags will be found (default is true)

Boolean

False

—

false

—

neTag

If true, ne tags will be found (default true)

Boolean

False

—

false

—

tokenize

True if the Sentences found should be tokenized, false if the tagger should use pre-set Tokens

Boolean

False

—

false

—

GenericTagger

Category: Tagger
Framework: GATE
Version: unknown

The Generic Tagger is Generic!

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

ISO-8859-1

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

failOnUnmappableCharacter

—

java.lang.Boolean

—

true

—

true

featureMapping

—

gate.FeatureMap

—

string=1;category=2;lemma=3

—

true

inputASName

—

java.lang.String

—

true

inputAnnotationType

—

java.lang.String

—

Token

—

true

inputTemplate

—

java.lang.String

—

${string}

—

true

outputASName

—

java.lang.String

—

true

outputAnnotationType

—

java.lang.String

—

Token

—

true

postProcessURL

—

java.net.URL

—

preProcessURL

—

java.net.URL

—

regex

—

java.lang.String

—

(.) (.) (.+)

—

true

taggerBinary

—

java.net.URL

—

true

taggerDir

—

java.net.URL

—

true

taggerFlags

—

java.util.List

—

true

updateAnnotations

—

java.lang.Boolean

—

true

—

true

GeniaTagger

Category: Tagger
Framework: AlvisNLP
Version: 2012-04-30

Runs Genia Tagger on annotations.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

chunk

—

java.lang.String

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

entity

—

java.lang.String

False

—

geniaCharset

—

java.lang.String

True

—

geniaDir

—

java.io.File

True

—

geniaTaggerExecutable

—

java.io.File

True

—

lemma

—

java.lang.String

True

—

pos

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentences

—

java.lang.String

True

—

wordForm

—

java.lang.String

True

—

words

—

java.lang.String

True

—

Hepple POS Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

Mark Hepple's POS tagger, from dragontools/Banner toolkit.

HepplePosTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

GATE Hepple part-of-speech tagger.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

lexiconLocation

Load the lexicon from this location instead of locating it automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

rulesetLocation

Load the ruleset from this location instead of locating it automatically.

String

False

—

false

—

Hindi POS Tagger

Category: Tagger
Framework: GATE
Version: unknown

Mark Hepple's Brill-style POS tagger, adapted for languages where entries are multiword

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

inputASName

—

java.lang.String

—

true

lexiconURL

—

java.net.URL

—

resources/tagger/hindi_lexicon

—

rulesURL

—

java.net.URL

—

resources/tagger/ruleset

—

HunPosTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Part-of-Speech annotator using HunPos. Requires Sentences to be annotated before.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

ILSP FBT Tagger

Category: Tagger
Framework: ILSP (UIMA)
Version: 1.14

ILSP FBT Tagger is an adaptation of the Brill tagger trained on Greek text. It uses a PAROLE compatible tagset of 584 different tags which capture the morphosyntactic particularities of the Greek language. Working on the output of a sentence detection and tokenisation tool, the tagger assigns initial tags, looking the words up in a lexicon created from a manually annotated corpus during training. A suffix lexicon is used for initially tagging of unknown words. 799 contextual rules are then applied to improve the initial phase output.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

LexicaDir

The directory containing the Berkeley DB lexical resources. Default is /opt/ilsp-nlp/lexica/fbt.

String

False

—

false

—

IULATagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

Performs paragraph splitting, sentence splitting, tokenisation and POS tagging. Also detects proper names. Operates on Spanish (es) and Catalan (ca), by setting the "language" parameter (default is Spanish).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

—

String

True

—

false

—

LingPipe POS Tagger PR

Category: Tagger
Framework: GATE
Version: unknown

Provides a LingPipe part of speech tagger.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

applicationMode

—

gate.lingpipe.POSApplicationMode

—

FIRSTBEST

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

modelFileUrl

—

java.net.URL

—

false

MateMorphTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

DKPro Annotator for the MateToolsMorphTagger.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

MatePosTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

DKPro Annotator for the MateToolsPosTagger

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

MeCabTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Annotator for the MeCab Japanese POS Tagger.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

language

The language.

String

False

—

false

—

strictZoning

Boolean

True

—

false

—

writeSentence

Create Sentence annotations.

Boolean

True

—

false

—

writeToken

Create Token annotations.

Boolean

True

—

false

—

zoneTypes

A list of type names used for zoning.

String

False

—

true

—

Measurement Tagger

Category: Tagger
Framework: GATE
Version: unknown

A measurement tagger based upon GNU Units

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

commonURL

—

java.net.URL

—

resources/common_words.txt

—

consumeNumberAnnotations

—

java.lang.Boolean

—

true

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

ignoredAnnotations

—

java.util.Set

—

Date;Money

—

true

inputASName

—

java.lang.String

—

true

japeURL

—

java.net.URL

—

resources/jape/main.jape

—

locale

—

java.lang.String

—

en_GB

—

outputASName

—

java.lang.String

—

true

unitsURL

—

java.net.URL

—

resources/units.dat

—

Medical Condition Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

A tagger that recognises mentions of medical conditions. Implemented based on string matching against entries in the Index of Diseases (http://resource.nlm.nih.gov/63540040R) and the Nomenclature of Diseases (http://resource.nlm.nih.gov/31910070R).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

useApproximateStringMatching

true if approximate string matching should be used

Boolean

False

—

false

—

NormaGene Tagger

Category: Tagger
Framework: GATE
Version: unknown

A processing resource that takes document and corpus parameters

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
annotationSetName	—	java.lang.String	—	—	—	true
corpus	—	gate.Corpus	—	—	—	true
document	—	gate.Document	—	—	—	true
threshold	—	java.lang.Double	—	0.6	—	true

Numbers Tagger

Category: Tagger
Framework: GATE
Version: unknown

Finds numbers in (both words and digits) and annotates them with their numeric value

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

allowWithinWords

—

java.lang.Boolean

—

false

—

true

annotationSetName

—

java.lang.String

—

true

configURL

—

java.net.URL

—

resources/languages/all.xml

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

postProcessURL

—

java.net.URL

—

resources/jape/post-process.jape

—

useHintsFromOriginalMarkups

—

java.lang.Boolean

—

true

—

true

OpenCalais Tagger

http://api.opencalais.com/enlighten/rest/

Category: Tagger
Framework: GATE
Version: unknown

An OpenCalais based semantic annotator

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

allowDistribution

—

java.lang.Boolean

—

false

—

true

allowSearch

—

java.lang.Boolean

—

false

—

true

calculateRelevanceScore

—

java.lang.Boolean

—

false

—

true

docRDFaccessible

—

java.lang.Boolean

—

false

—

true

document

—

gate.corpora.DocumentImpl

—

true

enableMetadataType

—

gate.opencalais.MetadataType

—

true

externalID

—

java.lang.String

—

true

licenseID

—

java.lang.String

—

openCalaisURL

—

java.net.URL

—

—

outputASName

—

java.lang.String

—

true

submitter

—

java.lang.String

—

true

OpenNLP POS Tagger

Category: Tagger
Framework: GATE
Version: unknown

POS Tagger using an OpenNLP maxent model

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

model

—

java.net.URL

—

models/english/en-pos-maxent.bin

—

outputASName

—

java.lang.String

—

true

OpenNlpPosTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Part-of-Speech annotator using OpenNLP. Requires Sentences to be annotated before.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

POS Mapper

Category: Tagger
Framework: GATE
Version: unknown

Map complex Russian morphology tags into simpler POS categories

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

Penn BioTagger

Category: Tagger
Framework: GATE
Version: unknown

Ready-made application for the Penn BioTagger

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

—

java.util.List

—

pipelineURL

—

java.net.URL

—

Penn BioTagger: Genes

Category: Tagger
Framework: GATE
Version: unknown

Penn BioTagger for Genes

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

modelURL

—

java.net.URL

—

resources/geneModel.crf.gz

—

outputASName

—

java.lang.String

—

true

Penn BioTagger: Malignancy

Category: Tagger
Framework: GATE
Version: unknown

Penn BioTagger for malignancy types

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

modelURL

—

java.net.URL

—

resources/malignancyModel.crf.gz

—

outputASName

—

java.lang.String

—

true

Penn BioTagger: Variation

Category: Tagger
Framework: GATE
Version: unknown

Penn BioTagger for variations

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

modelURL

—

java.net.URL

—

resources/variationModel.crf.gz

—

outputASName

—

java.lang.String

—

true

PosMapper

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Maps existing POS tags from one tagset to another using a user provided properties file.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

dkproMappingLocation

A properties file containing mappings from the new tagset to (fully qualified) DKPro POS classes. If such a file is not supplied, the DKPro POS classes stay the same regardless of the new POS tag value, and only the value is changed.

String

False

—

false

—

mappingFile

A properties file containing POS tagset mappings.

String

True

—

false

—

RASP POS Converter

Category: Tagger
Framework: GATE
Version: unknown

Converts from PennTreebank POS tags to the C2 tagset used by RASP. Generates annotations of type MorphObj which hold the tag and lemma

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

grammarURL

—

java.net.URL

—

resources/main.jape

—

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

RASP2 POS Tagger

Category: Tagger
Framework: GATE
Version: unknown

RASP part-of-speech tagger, creating WordForm annotations

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

charset

—

java.lang.String

—

ISO-8859-1

—

true

debug

—

java.lang.Boolean

—

false

—

true

document

—

gate.Document

—

true

generateMultipleTags

—

java.lang.Boolean

—

true

—

true

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

raspHome

—

java.net.URL

—

file:/usr/local/bin/RASP

—

false

RfTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Rftagger morphological analyzer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

MorphMappingLocation

—

String

False

—

false

—

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelEncoding

The character encoding used by the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

printTagSet

Write the tag set(s) to the log when a model is loaded.

Boolean

True

—

false

—

Roman Numerals Tagger

Category: Tagger
Framework: GATE
Version: unknown

Finds and annotates Roman numerals

Parameter	Description	Type	Mandatory	Default Value	Multi-value	Runtime
allowLowerCase	—	java.lang.Boolean	—	false	—	true
corpus	—	gate.Corpus	—	—	—	true
document	—	gate.Document	—	—	—	true
maxTailLength	—	java.lang.Integer	—	0	—	true
outputASName	—	java.lang.String	—	—	—	true

Russian POS Tagger

Category: Tagger
Framework: GATE
Version: unknown

Part-of-speech tagger for Russian

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

caseSensitive

—

java.lang.Boolean

—

true

—

config

—

java.net.URL

—

resources/morphology/main.conf

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

encoding

—

java.lang.String

—

UTF-8

—

SVMLight Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

Applies an SVMLight-trained model on instances.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

ModelFile

The SVMLight model

String

True

—

false

—

NormFile

The file containing the value of the norm, generated during model training

String

True

—

false

—

Species Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

Tags species

Stanford POS Tagger

Category: Tagger
Framework: GATE
Version: unknown

Stanford Part-of-Speech Tagger

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

baseSentenceAnnotationType

—

java.lang.String

—

Sentence

—

true

baseTokenAnnotationType

—

java.lang.String

—

Token

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

inputASName

—

java.lang.String

—

true

modelFile

—

java.net.URL

—

resources/english-left3words-distsim.tagger

—

outputASName

—

java.lang.String

—

true

outputAnnotationType

—

java.lang.String

—

Token

—

true

posTagAllTokens

—

java.lang.Boolean

—

true

—

true

useExistingTags

—

java.lang.Boolean

—

true

—

true

StanfordPosTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Stanford Part-of-Speech tagger component.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

String

False

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: false

Boolean

False

—

false

—

language

Use this language instead of the document language to resolve the model and tag set mapping.

String

False

—

false

—

maxSentenceLength

Sentences with more tokens than the specified max amount will be ignored if this parameter is set to a value larger than zero. The default value zero will allow all sentences to be POS tagged.

Integer

False

—

false

—

modelLocation

Location from which the model is read.

String

False

—

false

—

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

String

False

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Boolean

True

—

false

—

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

String

False

—

true

—

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

String

False

—

true

—

Stepp Tagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

TreeTagger

Category: Tagger
Framework: AlvisNLP
Version: 2010-10-28

Runs tree-tagger.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

formFeature

—

java.lang.String

True

—

inputCharset

—

java.lang.String

True

—

lemmaFeature

—

java.lang.String

True

—

lexiconFile

—

org.bibliome.util.streams.SourceStream

False

—

noUnknownLemma

—

java.lang.Boolean

False

—

outputCharset

—

java.lang.String

True

—

parFile

—

org.bibliome.util.files.InputFile

True

—

posFeature

—

java.lang.String

True

—

recordCharset

—

java.lang.String

True

—

recordDir

—

org.bibliome.util.files.OutputDirectory

False

—

recordFeatures

—

java.lang.String[]

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayerName

—

java.lang.String

True

—

treeTaggerExecutable

—

org.bibliome.util.files.ExecutableFile

True

—

wordLayerName

—

java.lang.String

True

—

TreeTaggerPosTagger

Category: Tagger
Framework: DKPro Core (UIMA)
Version: 1.8.0

Part-of-Speech and lemmatizer annotator using TreeTagger.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

String

False

—

false

—

executablePath

Use this TreeTagger executable instead of trying to locate the executable automatically.

String

False

—

false

—

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Boolean

False

—

false

—

language

Use this language instead of the document language to resolve the model.

String

False

—

false

—

modelEncoding

The character encoding used by the model.

String

False

—

false

—

modelLocation

Load the model from this location instead of locating the model automatically.

String

False

—

false

—

modelVariant

Override the default variant used to locate the model.

String

False

—

false

—

performanceMode

Boolean

True

—

false

—

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Boolean

True

—

false

—

writeLemma

Write lemma information.

Default: true

Boolean

True

—

false

—

writePOS

Write part-of-speech information.

Default: true

Boolean

True

—

false

—

[[Twitter_POS_Tagger_(EN)]] ==== Twitter POS Tagger (EN)

Category: Tagger
Framework: GATE
Version: unknown

Stanford POS tagger trained on Tweets

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

baseSentenceAnnotationType

—

java.lang.String

—

Sentence

—

true

baseTokenAnnotationType

—

java.lang.String

—

Token

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

failOnMissingInputAnnotations

—

java.lang.Boolean

—

true

—

true

inputASName

—

java.lang.String

—

true

modelFile

—

java.net.URL

—

resources/pos/gate-EN-twitter.model

—

outputASName

—

java.lang.String

—

true

outputAnnotationType

—

java.lang.String

—

Token

—

true

posTagAllTokens

—

java.lang.Boolean

—

true

—

true

useExistingTags

—

java.lang.Boolean

—

true

—

true

UaicPosTagger

Category: Tagger
Framework: NaCTeM (UIMA)
Version: 1.0

Carries out sentence splitting, tokenisation, POS tagging and lemmatitisation on plain text.

Topics (3)

MalletTopicModelEstimator

Category: Topics
Framework: DKPro Core (UIMA)
Version: 1.8.0

Estimate an LDA topic model using Mallet and write it to a file. It stores all incoming CAS' to Mallet Instances before estimating the model, using a ParallelTopicModel.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

alphaSum

The sum of alphas over all topics. Default: 1.0. Another recommended value is 50 / T (number of topics).

Float

True

—

false

—

beta

Beta for a single dimension of the Dirichlet prior. Default: 0.01.

Float

True

—

false

—

burninPeriod

The number of iterations before hyperparameter optimization begins. Default: 100

Integer

True

—

false

—

displayInterval

The interval in which to display the estimated topics. Default: 50.

Integer

True

—

false

—

displayNTopicWords

The number of top words to display during estimation. Default: 7.

Integer

True

—

false

—

minTokenLength

Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3.

Integer

True

—

false

—

modelEntityType

If specific, the text contained in the given segmentation type annotations are fed as separate units to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored. By default, the full document text is used as a document.

String

False

—

false

—

nIterations

The number of iterations during model estimation. Default: 1000.

Integer

True

—

false

—

nThreads

The number of threads to use during model estimation. Default: 1.

Integer

True

—

false

—

nTopics

The number of topics to estimate for the topic model.

Integer

True

—

false

—

optimizeInterval

Interval for optimizing Dirichlet hyperparameters. Default: 50

Integer

True

—

false

—

randomSeed

Set random seed. If set to -1 (default), uses random generator.

Integer

True

—

false

—

saveInterval

Define how often to save a serialized model during estimation. Default: 0 (only save when estimation is done).

Integer

True

—

false

—

targetLocation

The target model file location.

String

True

—

false

—

typeName

The annotation type to use for the topic model. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token.

String

True

—

false

—

useLemma

If set, uses lemmas instead of original text as features.

Boolean

True

—

false

—

useSymmetricAlph

Use a symmatric alpha value during model estimation? Default: false.

Boolean

True

—

false

—

MalletTopicModelInferencer

Category: Topics
Framework: DKPro Core (UIMA)
Version: 1.8.0

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

burnIn

The number of iterations before hyperparameter optimization begins. Default: 1

Integer

True

—

false

—

maxTopicAssignments

Maximum number of topics to assign. If not set (or <= 0), the number of topics in the model divided by 10 is set.

Integer

True

—

false

—

minTokenLength

Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3.

Integer

True

—

false

—

minTopicProb

Minimum topic proportion for the document-topic assignment.

Float

True

—

false

—

modelLocation

—

String

True

—

false

—

nIterations

The number of iterations during inference. Default: 10.

Integer

True

—

false

—

thinning

—

Integer

True

—

false

—

typeName

The annotation type to use as tokens. Default: Token

String

True

—

false

—

useLemma

If set, uses lemmas instead of original text as features.

Boolean

True

—

false

—

Textalytics Topics Extraction

http://textalytics.com/core/topics-1.2

Category: Topics
Framework: GATE
Version: unknown

Textalytics Topics Extraction

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

apiURL

—

java.lang.String

—

—

true

caseSensitive

—

java.lang.Boolean

—

true

context

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

debug

—

java.lang.Boolean

—

true

dictionary

—

java.lang.String

—

true

disambiguationLevel

—

daedalus.textalytics.gate.param.DisambiguationLevel

—

strong_disambiguation

—

true

document

—

gate.Document

—

true

inputASTypes

—

java.util.List

—

true

inputASname

—

java.lang.String

—

true

key

—

java.lang.String

—

true

lang

—

java.lang.String

—

true

outputASname

—

java.lang.String

—

Textalytics

—

true

relaxedTypography

—

java.lang.Boolean

—

true

subTopics

—

java.lang.Boolean

—

true

timeref

—

java.lang.String

—

true

topicTypes

—

java.lang.String

—

true

udDictionaries

—

java.util.List

—

true

unknownWords

—

java.lang.Boolean

—

true

Validation (1)

Schema Enforcer

Category: Validation
Framework: GATE
Version: unknown

Produces an annotation set whose content is restricted by the specified set of schemas

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

outputASName

—

java.lang.String

—

true

schemas

—

java.util.List

—

true

useDefaults

—

java.lang.Boolean

—

false

—

true

Viewer/Editor (18)

Compound Document Editor

Category: Viewer/Editor
Framework: GATE
Version: unknown

Editor for compound documents.

GATE Ontology Editor

Category: Viewer/Editor
Framework: GATE
Version: unknown

Ontology editing tool.

GAZE

Category: Viewer/Editor
Framework: GATE
Version: unknown

Gazetteer viewer and editor

Gazetteer Editor

Category: Viewer/Editor
Framework: GATE
Version: unknown

Gazetteer viewer and editor.

JAPE-Plus Viewer

Category: Viewer/Editor
Framework: GATE
Version: unknown

A JAPE grammar file viewer

Jape Viewer

Category: Viewer/Editor
Framework: GATE
Version: unknown

A JAPE grammar file viewer

OAT

Category: Viewer/Editor
Framework: GATE
Version: unknown

Ontology Annotation Tool.

Pairbank Viewer

Category: Viewer/Editor
Framework: GATE
Version: unknown

viewer for the TermRaider Pairbank

RAT-C

Category: Viewer/Editor
Framework: GATE
Version: unknown

Relation Annotation Tool Class view.

RAT-I

Category: Viewer/Editor
Framework: GATE
Version: unknown

Relation Annotation Tool Instance view.

Schema Annotations Editor

Category: Viewer/Editor
Framework: GATE
Version: unknown

An annotation editor restricted by schemas.

Script Editor

Category: Viewer/Editor
Framework: GATE
Version: unknown

Editor for the Groovy script behind this PR

Shell

Category: Viewer/Editor
Framework: AlvisNLP
Version: 2012-04-30

Starts an interactive shell that allows to query the corpus data structure.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

historyFile

—

org.bibliome.util.files.OutputFile

False

—

prompt

—

java.lang.String

True

—

Shell2

Category: Viewer/Editor
Framework: AlvisNLP
Version:

Starts an interactive shell that allows to query the corpus data structure.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

constantAnnotationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantDocumentFeatures

—

alvisnlp.module.types.Mapping

False

—

constantRelationFeatures

—

alvisnlp.module.types.Mapping

False

—

constantSectionFeatures

—

alvisnlp.module.types.Mapping

False

—

constantTupleFeatures

—

alvisnlp.module.types.Mapping

False

—

Simple Schema Viewer

Category: Viewer/Editor
Framework: GATE
Version: unknown

A Simple Annotation Schema Viewer

Syntax tree viewer

Category: Viewer/Editor
Framework: GATE
Version: unknown

Viewer for syntax trees generated by a parser.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

tokenType

—

java.lang.String

—

Token

—

false

treeNodeAnnotationType

—

java.lang.String

—

SyntaxTreeNode

—

false

Termbank Viewer

Category: Viewer/Editor
Framework: GATE
Version: unknown

viewer for the TermRaider Termbank

WordNet Viewer

Category: Viewer/Editor
Framework: GATE
Version: unknown

WordNet viewer

Writer (64)

ADBWriter

Category: Writer
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

annotationType

—

alvisnlp.corpus.expressions.Expression

False

—

annotations

—

alvisnlp.corpus.expressions.Expression

False

—

aspectId

—

alvisnlp.corpus.expressions.Expression

True

—

docScopeAnnType

—

java.lang.String[]

False

—

documents

—

alvisnlp.corpus.expressions.Expression

True

—

fragments

—

alvisnlp.corpus.expressions.Expression

False

—

groups

—

alvisnlp.corpus.expressions.Expression

False

—

password

—

java.lang.String

True

—

relations

—

alvisnlp.corpus.expressions.Expression

False

—

schema

—

java.lang.String

False

—

sections

—

alvisnlp.corpus.expressions.Expression

True

—

toDocScopeAnnotation

—

alvisnlp.corpus.expressions.Expression[]

False

—

url

—

java.lang.String

True

—

username

—

java.lang.String

True

—

AlvisDBIndexer

Category: Writer
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

append

—

java.lang.Boolean

False

—

elements

—

org.bibliome.alvisnlp.modules.alvisdb.ADBElements[]

True

—

indexDir

—

org.bibliome.util.files.OutputDirectory

True

—

AlvisIRIndexer

Category: Writer
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

clearIndex

—

java.lang.Boolean

True

—

documents

—

org.bibliome.alvisnlp.modules.alvisir2.IndexedDocuments

True

—

fieldNames

—

java.lang.String[]

True

—

indexDir

—

org.bibliome.util.files.OutputDirectory

True

—

propertyKeys

—

java.lang.String[]

True

—

recordGlobalIndexAttributes

—

java.lang.Boolean

True

—

relations

—

alvisnlp.module.types.MultiMapping

True

—

tokenPositionGap

—

java.lang.Integer

True

—

BIO Format Writer Cas Consumer

Category: Writer
Framework: NaCTeM (UIMA)
Version: 1.0

Writes specified types of annotations to the specified directory in the BIO format. BIO format is one line per token, token [tab] label, empty line at the end of each sentence (if SentencePerLine is true, one line per sentence, tokenization by spaces where a token is followed by a label like "token|label";). Label is one of O, B-suffix, I-suffix. Suffix should be specified as a string list of mapping from fully qualified type name to its suffix by using comma, e.g. "org.u_compare.syntactic.Sentence,Sent". Sentence and Token annotations required.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

SentencePerLine

If true, merges one sentence into one line with

as delimiter.

Boolean

True

—

false

—

TypeToBioSuffixMap

Fully qualified type name, comma, suffix string

String

True

—

true

—

outputDir

output directory

String

True

—

false

BinaryCasWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Write CAS in one of the UIMA binary formats.

Supported formats
Format	Description	Type system on load	CAS Addresses preserved
S	CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). Because these structures are pre-allocated in memory at larger sizes than what is actually required, files in this format may be larger than necessary. However, the CAS addresses of feature structures are preserved in this format. When the data is loaded back into a CAS, it must have been initialized with the same type system as the original CAS.	must be the same	yes
S+	CAS structures are dumped to disc as they are using Java serialization as in form 0, but now using the CASCompleteSerializer which includes CAS metadata like type system and index repositories.	is reinitialized	yes
0	CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). This is basically the same as format S but includes a UIMA header and can be read using org.apache.uima.cas.impl.Serialization#deserializeCAS.	must be the same	yes
4	UIMA binary serialization saving all feature structures (reachable or not). This format internally uses gzip compression and a binary representation of the CAS, making it much more efficient than format 0.	must be the same	yes
6	UIMA binary serialization as format 4, but saving only reachable feature structures.	must be the same	no
6+	UIMA binary serialization as format 6, but also contains the type system defintion. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system.	lenient loading	no

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameExtension

—

String

True

—

false

—

format

—

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

typeSystemLocation

Location to write the type system to. The type system is saved using Java serialization, it is not saved as a XML type system description. We recommend to use the name typesystem.ser. The #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the type system file should be compressed or not is detected from the file name extension (e.g. ".gz"). If this parameter is set, the type system and index repository are no longer serialized into the same file as the test of the CAS. The SerializedCasReader can currently not read such files. Use this only if you really know what you are doing. This parameter has no effect if formats S+ or 6+ are used as the type system information is embedded in each individual file. Otherwise, it is recommended that this parameter be set unless some other mechanism is used to initialize the CAS with the same type system and index repository during reading that was used during writing.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

BioC Writer

Category: Writer
Framework: NaCTeM (UIMA)
Version: 1.0

Writes BioC annotations to files. Each output file will consist of a single document only. BioC website: http://bioc.sourceforge.net/

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

outputFile

A path to a file where an entire collection will be written to.

String

True

—

false

—

BioNLP ST Data Writer

Category: Writer
Framework: NaCTeM (UIMA)
Version: 1.0

Writes BioNLP entity and event annotations to files.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

OutputFolder

A folder where BioNLP ST files will be written to.

String

True

—

false

—

BratWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Writer for the brat annotation format.

Known issues:

Brat is unable to read relation attributes created by this writer.
PARAM_TYPE_MAPPINGS not implemented yet

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

enableTypeMappings

Enable type mappings.

Boolean

True

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

excludeTypes

Types that will not be written to the exported file.

String

True

—

true

—

filenameSuffix

Specify the suffix of output files. Default value <code>.ann</code>. If the suffix is not needed, provide an empty string as value.

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

palette

Colors to be used for the visual configuration that is generated for brat.

String

False

—

true

—

relationTypes

String

True

—

true

—

shortAttributeNames

Whether to render attributes by their short name or by their qualified name.

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

spanTypes

Types that are text annotations (aka entities or spans).

String

True

—

true

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

textFilenameSuffix

Specify the suffix of text output files. Default value <code>.txt</code>. If the suffix is not needed, provide an empty string as value.

String

True

—

false

—

typeMappings

FIXME

String

False

—

true

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

writeNullAttributes

Enable writing of features with null values.

Boolean

True

—

false

—

writeRelationAttributes

The brat web application can currently not handle attributes on relations, thus they are disabled by default. Here they can be enabled again.

Boolean

True

—

false

—

CasDumpWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Dumps CAS content to a text file. This is useful when setting up test cases which contain a reference output to which an actually produced CAS is compared. The format produced by this component is more easily comparable than a XCAS or XMI format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

featurePatterns

Include/exclude features according to the following patterns. Mind that the patterns do not actually match feature names but lines produced by FeatureStructure.toString().

String

True

—

true

—

sort

Sort increasing by begin, decreasing by end, increasing by name instead of relying on index order.

Boolean

True

—

false

—

targetLocation

Output file. If multiple CASes as processed, their contents are concatenated into this file. Mind that a test case using this consumer with multiple CASes requires a reader which produced the CASes always in the same order. When this file is set to "-", the dump does to System#out (default).

String

True

—

false

—

typePatterns

Include/exclude specified UIMA types in the output.

String

True

—

true

—

writeDocumentMetaData

Whether to dump the content of the CAS#getDocumentAnnotation().

Boolean

True

—

false

—

CoNLL2007 Cas Consumer

Category: Writer
Framework: ILSP (UIMA)
Version: 1.12

Writes sentences from the CAS in the CoNLL 2007 format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

AppendExt

Extension to be appended to the output files.

String

False

—

false

—

OutputDirectory

Directory where the output files will be written

String

True

—

false

—

PrintDepRels

If true, prints dependency relations

Boolean

False

—

false

—

StripExt

Extension to be stripped from the input files.

String

False

—

false

—

Configurable Exporter

Category: Writer
Framework: GATE
Version: unknown

Allows annotations to be exported according to a specified format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

configFileURL

—

java.net.URL

—

resources/configurableexporter/example.conf

—

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

instanceName

—

java.lang.String

—

true

outputURL

—

java.net.URL

—

true

Conll2000Writer

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Writes the CoNLL 2000 chunking format. The columns are separated by spaces.


He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

FORM - token
POSTAG - part-of-speech tag
CHUNK - chunk (BIO encoded)

Sentences are separated by a blank new line.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

—

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

writeChunk

—

Boolean

True

—

false

—

writePOS

—

Boolean

True

—

false

—

Conll2002Writer

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Writes the CoNLL 2002 named entity format. The columns are separated by a single space, unlike illustrated below.


Wolff      B-PER
,          O
currently  O
a          O
journalist O
in         O
Argentina  B-LOC
,          O
played     O
with       O
Del        B-PER
Bosque     I-PER
in         O
the        O
final      O
years      O
of         O
the        O
seventies  O
in         O
Real       B-ORG
Madrid     I-ORG
.          O

FORM - token
NER - named entity (BIO encoded)

Sentences are separated by a blank new line.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

—

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

writeNamedEntity

—

Boolean

True

—

false

—

Conll2006Writer

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Writes a file in the CoNLL-2006 format (aka CoNLL-X).


Heutzutage heutzutage ADV _ _ ADV _ _

ID - token number in sentence
FORM - token
LEMMA - lemma
CPOSTAG - part-of-speech tag (coarse grained)
POSTAG - part-of-speech tag
FEATS - unused
HEAD - target token for a dependency parsing
DEPREL - function of the dependency parsing
PHEAD - unused
PDEPREL - unused

Sentences are separated by a blank new line

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

—

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

writeDependency

—

Boolean

True

—

false

—

writeLemma

—

Boolean

True

—

false

—

writeMorph

—

Boolean

True

—

false

—

writePOS

—

Boolean

True

—

false

—

Conll2009Writer

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Writes a file in the CoNLL-2009 format.

ID - (ignored) Token counter, starting at 1 for each new sentence.
FORM - (Token) Word form or punctuation symbol.
LEMMA - (Lemma) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
PLEMMA - (ignored) Automatically predicted lemma of FORM
POS - (POS) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
PPOS - (ignored) Automatically predicted major POS by a language-specific tagger
FEAT - (MorphologicalFeatures) Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.
PFEAT - (ignored) Automatically predicted morphological features (if applicable)
HEAD - (Dependency) Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
PHEAD - (ignored) Automatically predicted syntactic head
DEPREL - (Dependency) Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningfull or simply 'ROOT'.
PDEPREL - (ignored) Automatically predicted dependency relation to PHEAD
FILLPRED - (auto-generated) Contains 'Y' for argument-bearing tokens
PRED - (SemanticPredicate) (sense) identifier of a semantic 'predicate' coming from a current token
APREDs - (SemanticArgument) Columns with argument labels for each semantic predicate (in the ID order)

Sentences are separated by a blank new line

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

—

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

writeDependency

—

Boolean

True

—

false

—

writeLemma

—

Boolean

True

—

false

—

writeMorph

—

Boolean

True

—

false

—

writePOS

—

Boolean

True

—

false

—

writeSemanticPredicate

—

Boolean

True

—

false

—

Conll2012Writer

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Writer for the CoNLL-2009 format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

—

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

writeLemma

—

Boolean

True

—

false

—

writePOS

—

Boolean

True

—

false

—

writeSemanticPredicate

—

Boolean

True

—

false

—

EnrichedDocumentWriter

Category: Writer
Framework: AlvisNLP
Version: 2010-10-28

Writes the corpus in the infamous Alvis Enriched Document Format suitable for indexation with Zebra-Alvis.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

blockSize

—

java.lang.Integer

True

—

blockStart

—

java.lang.Integer

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

idMetaFeature

—

java.lang.String

True

—

lemmaFeature

—

java.lang.String

True

—

metaTrans

—

alvisnlp.module.types.Mapping

True

—

neCanonicalFormFeature

—

java.lang.String

True

—

neLayerName

—

java.lang.String

True

—

neTypeFeature

—

java.lang.String

True

—

outDir

—

org.bibliome.util.files.OutputDirectory

True

—

outFilePrefix

—

java.lang.String

True

—

outFileSuffix

—

java.lang.String

True

—

posFeature

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

semanticFeature

—

java.lang.String

False

—

sentenceLayerName

—

java.lang.String

True

—

termCanonicalFormFeature

—

java.lang.String

True

—

termLayerName

—

java.lang.String

True

—

tokenLayerName

—

java.lang.String

True

—

tokenTypeFeature

—

java.lang.String

True

—

urlPrefix

—

java.lang.String

True

—

urlSuffixFeature

—

java.lang.String

True

—

wordLayerName

—

java.lang.String

True

—

ExportAlignmentPR

Category: Writer
Framework: GATE
Version: unknown

A PR to export alignment information in an xml file.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

inputASName

—

java.lang.String

—

true

outputDirectory

—

java.net.URL

—

true

parentOfUnitOfAlignment

—

java.lang.String

—

Sentence

—

true

parentOfUnitOfAlignmentFeatureName

—

java.lang.String

—

sentence-alignment

—

true

sourceDocumentID

—

java.lang.String

—

true

targetDocumentID

—

java.lang.String

—

true

unitAlignmentFeatureName

—

java.lang.String

—

word-alignment

—

true

unitOfAlignment

—

java.lang.String

—

Token

—

true

ExportCadixeJSON

Category: Writer
Framework: AlvisNLP
Version: 2012-04-30

Writes each document in a file in the AlvisAE protocol format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

annotationSets

—

org.bibliome.alvisnlp.modules.cadixe.AnnotationSet[]

True

—

documentDescription

—

alvisnlp.corpus.expressions.Expression

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

documentProperties

—

alvisnlp.module.types.ExpressionMapping

False

—

fileName

—

alvisnlp.corpus.expressions.Expression

True

—

outDir

—

org.bibliome.util.files.OutputDirectory

True

—

owner

—

java.lang.Integer

True

—

schemaFile

—

org.bibliome.util.files.InputFile

False

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

ExpressionExtract

Category: Writer
Framework: AlvisNLP
Version: 2012-04-30

Write elements in a tab separated file.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

fields

—

alvisnlp.corpus.expressions.Expression[]

True

—

headers

—

java.lang.String[]

False

—

outFile

—

org.bibliome.util.streams.TargetStream

True

—

target

—

alvisnlp.corpus.expressions.Expression

True

—

Factored Tag Lem Consumer

Category: Writer
Framework: ILSP (UIMA)
Version: 1.2

Writes sentences from the CAS in the Factored Tag Lem format

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

AppendExt

Extension to be appended to the output files.

String

False

—

false

—

OutputDirectory

Directory where the output files will be written

String

True

—

false

—

StripExt

Extension to be stripped from the input files.

String

False

—

false

—

Fast Infoset Exporter

Category: Writer
Framework: GATE
Version: unknown

Export GATE documents to GATE XML stored in the binary Fast Infoset format

FillDB

Category: Writer
Framework: AlvisNLP
Version: 2012-04-30

Stores the corpus into a SQL database.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

jdbcDriver

—

java.lang.String

True

—

password

—

java.lang.String

True

—

schema

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

url

—

java.lang.String

True

—

username

—

java.lang.String

True

—

Flexible Exporter

Category: Writer
Framework: GATE
Version: unknown

Exports a document with GATE annotations to its original format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

annotationTypes

—

java.util.ArrayList

—

Person;Location;Date

—

true

document

—

gate.Document

—

true

dumpTypes

—

java.util.ArrayList

—

Person;Location;Date

—

true

includeFeatures

—

java.lang.Boolean

—

false

—

outputDirectoryUrl

—

java.net.URL

—

true

suffixForDumpFiles

—

java.lang.String

—

.gate

—

useStandOffXML

—

java.lang.Boolean

—

false

—

useSuffixForDumpFiles

—

java.lang.Boolean

—

true

—

GATE JSON Exporter

Category: Writer
Framework: GATE
Version: unknown

Export documents and corpora in JSON format

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationTypes

—

java.util.Set

—

true

documentAnnotationASName

—

java.lang.String

—

Original markups

—

true

documentAnnotationType

—

java.lang.String

—

true

entitiesAnnotationSetName

—

java.lang.String

—

true

exportAsArray

—

java.lang.Boolean

—

false

—

true

GATE XML Writer CAS Consumer

Category: Writer
Framework: ILSP (UIMA)
Version: 1.0

Writes the CAS to GATE XML format

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

AppendExt

Extension to be appended to the output files.

String

False

—

false

—

OutputDirectory

Directory where the XML files will be written

String

True

—

false

—

StripExt

Extension to be stripped from the input files.

String

False

—

false

—

GeniaWriter

Category: Writer
Framework: AlvisNLP
Version: 2012-04-30

Writes each section in three files in the BioNLP challenge format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

dependencies

—

alvisnlp.corpus.expressions.Expression

False

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

entities

—

alvisnlp.module.types.ExpressionMapping

True

—

entityForm

—

alvisnlp.corpus.expressions.Expression

True

—

eventExtra

—

alvisnlp.corpus.expressions.Expression

False

—

events

—

alvisnlp.module.types.ExpressionMapping

True

—

fileName

—

alvisnlp.corpus.expressions.Expression

True

—

labelFeature

—

java.lang.String

False

—

outputDir

—

org.bibliome.util.files.OutputDirectory

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceForm

—

alvisnlp.corpus.expressions.Expression

True

—

sentences

—

alvisnlp.corpus.expressions.Expression

False

—

wordForm

—

alvisnlp.corpus.expressions.Expression

True

—

words

—

alvisnlp.corpus.expressions.Expression

False

—

HTML5 Microdata Exporter

Category: Writer
Framework: GATE
Version: unknown

Exports Annotations as HTML5 Microdata

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

configURL

—

java.net.URL

—

resources/schema.org/ANNIE.xml

—

true

ILSP GrAF Consumer

Category: Writer
Framework: ILSP (UIMA)
Version: 0.9

Writes sentences from the CAS to GrAF standoff format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

AppendExt

Extension to be appended to the output files.

String

False

—

false

—

OutputChunkFile

—

String

False

—

false

—

OutputDepFile

—

String

False

—

false

—

OutputDirectory

Directory where the XML files will be written

String

False

—

false

—

OutputDotFile

—

String

False

—

false

—

OutputHeaderFile

—

String

False

—

false

—

OutputNerFile

—

String

False

—

false

—

OutputPosFile

—

String

False

—

false

—

OutputRegFile

—

String

False

—

false

—

OutputSegFile

—

String

False

—

false

—

OutputSentFile

—

String

False

—

false

—

OutputTxtFile

—

String

False

—

false

—

StripExt

Extension to be stripped from the input files.

String

False

—

false

—

ILSP PML Cas Consumer

Category: Writer
Framework: ILSP (UIMA)
Version: 0.9

Writes sentences from the CAS in the Prague Markup Language format for editing dependency structures in TrEd

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

AppendExt

Extension to be appended to the output files.

String

False

—

false

—

OutputDirectory

Directory where the output files will be written

String

True

—

false

—

StripExt

Extension to be stripped from the input files.

String

False

—

false

—

ILSP XCES Consumer

Category: Writer
Framework: ILSP (UIMA)
Version: 0.9

Writes sentences from the CAS to the XCES format

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

AppendExt

Extension to be appended to the output files.

String

False

—

false

—

OutputDirectory

Directory where the XML files will be written

String

True

—

false

—

StripExt

Extension to be stripped from the input files.

String

False

—

false

—

ILSP Xmi Writer CAS Consumer

Category: Writer
Framework: ILSP (UIMA)
Version: 0.9

Serializes the CAS to XMI.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

AppendExt

Extension to be appended to the output files.

String

False

—

false

—

OutputDirectory

Directory where the XMI files will be written

String

True

—

false

—

StripExt

Extension to be stripped from the input files.

String

False

—

false

—

ImsCwbWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

This Consumer outputs the content of all CASes into the IMS workbench format. This writer produces a text file which needs to be converted to the binary IMS CWB index files using the command line tools that come with the CWB. It is possible to set the parameter #PARAM_CQP_HOME to directly create output in the native binary CQP format via the original CWB command line tools.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

additionalFeatures

Write additional token-level annotation features. These have to be given as an array of fully qualified feature paths (fully.qualified.classname/featureName). The names for these annotations in CQP are their lowercase shortnames.

String

False

—

true

—

corpusName

The name of the generated corpus.

String

True

—

false

—

cqpCompress

Set this parameter to compress the token streams and the indexes using cwb-huffcode and cwb-compress-rdx. With modern hardware, this may actually slow down queries, so we turn it off by default. If you have large data sets, you best try yourself what works best for you. (default: false)

Boolean

True

—

false

—

cqpHome

Set this parameter to the directory containing the cwb-encode and cwb-makeall commands if you want the write to directly encode into the CQP binary format.

String

False

—

false

—

cqpwebCompatibility

Make document IDs compatible with CQPweb. CQPweb demands an id consisting of only letters, numbers and underscore.

Boolean

True

—

false

—

sentenceTag

—

String

True

—

false

—

targetEncoding

Character encoding of the output data.

String

True

—

false

—

targetLocation

Location to which the output is written.

String

True

—

false

—

writeCPOS

Write coarse-grained part-of-speech tags. These are the simple names of the UIMA types used to represent the part-of-speech tag.

Boolean

True

—

false

—

writeDocId

Write the document ID for each token. It is usually a better idea to generate a #PARAM_WRITE_DOCUMENT_TAG document tag or a #PARAM_WRITE_TEXT_TAG text tag which also contain the document ID that can be queried in CQP.

Boolean

True

—

false

—

writeDocumentTag

Write a pseudo-XML tag with the name document to mark the start and end of a document.

Boolean

True

—

false

—

writeLemma

Write lemmata.

Boolean

True

—

false

—

writeOffsets

Write the start and end position of each token.

Boolean

True

—

false

—

writePOS

Write part-of-speech tags.

Boolean

True

—

false

—

writeTextTag

Write a pseudo-XML tag with the name text to mark the start and end of a document. This is used by CQPweb.

Boolean

True

—

false

—

InlineXmlWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Writes an approximation of the content of a textual CAS as an inline XML file. Optionally applies an XSLT stylesheet.

Note this component inherits the restrictions from CasToInlineXml:

Features whose values are FeatureStructures are not represented.
Feature values which are strings longer than 64 characters are truncated.
Feature values which are arrays of primitives are represented by strings that look like [ xxx, xxx ]
The Subject of analysis is presumed to be a text string.
Some characters in the document's Subject-of-analysis are replaced by blanks, because the characters aren't valid in xml documents.
It doesn't work for annotations which are overlapping, because these cannot be properly represented as properly - nested XML.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

Xslt

XSLT stylesheet to apply.

String

False

—

false

—

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

JsonWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

UIMA JSON format writer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

jsonContextFormat

—

String

True

—

false

—

omitDefaultValues

—

Boolean

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

prettyPrint

—

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

typeSystemFile

Location to write the type system to. If this is not set, a file called typesystem.xml will be written to the XMI output path. If this is set, it is expected to be a file relative to the current work directory or an absolute file. If this parameter is set, the #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the file name ends in ".gz", the file will be compressed, otherwise not.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

Legacy Coref Data Writer

Category: Writer
Framework: GATE
Version: unknown

A simple PR that converts co-reference data from the Relations-based model to the legacy format (based on 'matches' annotation and document features).

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

corpus

—

gate.Corpus

—

true

document

—

gate.Document

—

true

MalletTopicProportionsWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Write topic proportions to a file in the shape depends on the {@link TopicDistribution annotation which should have been created by MalletTopicModelInferencer before.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

—

String

True

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

MalletTopicsProportionsSortedWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Write the topic proportions according to an LDA topic model to an output file. The proportions need to be inferred in a previous step using MalletTopicModelInferencer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

nTopics

—

Integer

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

—

String

True

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

PennTreebankCombinedWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Penn Treebank combined format writer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

emptyRootLabel

—

Boolean

True

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

Specify the suffix of output files. Default value <code>.penn</code>. If the suffix is not needed, provide an empty string as value.

String

True

—

false

—

noRootLabel

—

Boolean

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

String

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

RDF Writer

Category: Writer
Framework: NaCTeM (UIMA)
Version: 1.0

Saves Common Annotation Structures into RDF files.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

outputFilePrefix

A name that will be attached to the beginning of an output filename. Filenames will have the form of "<outputFilePrefix><count>.rdf".

String

True

—

false

—

outputFolder

A folder where RDF files will be written to.

String

True

—

false

—

RDFExport

Category: Writer
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

charset

—

java.lang.String

True

—

fileName

—

alvisnlp.corpus.expressions.Expression

True

—

files

—

alvisnlp.corpus.expressions.Expression

True

—

format

—

org.apache.jena.riot.RDFFormat

True

—

outDir

—

org.bibliome.util.files.OutputDirectory

True

—

prefixes

—

alvisnlp.module.types.Mapping

True

—

statements

—

alvisnlp.corpus.expressions.Expression[]

True

—

RelpWriter

Category: Writer
Framework: AlvisNLP
Version: 2012-04-30

Writes the corpus in relp format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

dependencyLabelFeature

—

java.lang.String

True

—

dependencyRelation

—

java.lang.String

True

—

dependentForm

—

alvisnlp.corpus.expressions.Expression

True

—

dependentRole

—

java.lang.String

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

headForm

—

alvisnlp.corpus.expressions.Expression

True

—

headRole

—

java.lang.String

True

—

lemmaForm

—

alvisnlp.corpus.expressions.Expression

True

—

linkageNumberFeature

—

java.lang.String

False

—

outFile

—

org.bibliome.util.streams.TargetStream

True

—

pmid

—

alvisnlp.corpus.expressions.Expression

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentenceLayer

—

java.lang.String

True

—

sentenceRole

—

java.lang.String

True

—

wordForm

—

alvisnlp.corpus.expressions.Expression

True

—

wordLayer

—

java.lang.String

True

—

SFTP XMI Writer

Category: Writer
Framework: NaCTeM (UIMA)
Version: 1.0

Saves Common Annotation Structures to an SFTP server

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

password

—

String

True

—

false

—

port

—

Integer

False

—

false

—

recorderEnabled

—

Boolean

True

—

false

—

recorderJdbcUrl

—

String

False

—

false

—

recorderPassword

—

String

False

—

false

—

recorderUsername

—

String

False

—

false

—

remoteDirectory

—

String

True

—

false

—

server

—

String

True

—

false

—

username

—

String

True

—

false

—

SerializedCasWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameExtension

—

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

typeSystemLocation

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

Simplified Text Exporter

Category: Writer
Framework: GATE
Version: unknown

Simplified text exporter (HTML output)

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

Simplified Text Exporter

Category: Writer
Framework: GATE
Version: unknown

Simplified text exporter (plain text output)

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

annotationSetName

—

java.lang.String

—

true

SolrWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

A simple implementation of SolrWriter_ImplBase

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

optimizeIndex

If set to true, the index is optimized once all documents are uploaded. Default is false.

Boolean

True

—

false

—

queueSize

The buffer size before the documents are sent to the server (default: 10000).

Integer

True

—

false

—

solrIdField

The name of the id field in the Solr schema (default: "id").

String

True

—

false

—

targetLocation

Solr server URL string in the form <prot>://<host>:<port>/<path>, e.g. http://localhost:8983/solr/collection1.

String

True

—

false

—

textField

The name of the text field in the Solr schema (default: "text").

String

True

—

false

—

threads

The number of background threads used to empty the queue. Default: 1.

Integer

True

—

false

—

update

Define whether existing documents with same ID are updated (true) of overwritten (false)? Default: true (update).

Boolean

True

—

false

—

waitFlush

When committing to the index, i.e. when all documents are processed, block until index changes are flushed to disk? Default: true.

Boolean

True

—

false

—

waitSearcher

When committing to the index, i.e. when all documents are processed, block until a new searcher is opened and registered as the main query searcher, making the changes visible? Default: true.

Boolean

True

—

false

—

TGrepWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

TGrep2 corpus file writer. Requires PennTrees to be annotated before.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Method to compress the tgrep file (only used if PARAM_WRITE_T2C is true). Only NONE, GZIP and BZIP2 are supported.

Default: CompressionMethod#NONE

String

True

—

false

—

dropMalformedTrees

If true, silently drops malformed Penn Trees instead of throwing an exception.

Default: false

Boolean

True

—

false

—

targetLocation

Path to which the output is written.

String

True

—

false

—

writeComments

Set this parameter to true if you want to add a comment to each PennTree which is written to the output files. The comment is of the form documentId,beginOffset,endOffset.

Default: true

Boolean

True

—

false

—

writeT2c

Set this parameter to true if you want to encode directly into the tgrep2 binary format.

Default: true

Boolean

True

—

false

—

TSV Writer

Category: Writer
Framework: NaCTeM (UIMA)
Version: 0.1

Saves annotations of a selected type to a file in tab-separated-value format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

FeaturePathSizeLimit

The maximum size of feature paths. Features of complex types will be traversed if this value is greater than 0.

Integer

True

—

false

—

OutputFile

—

String

True

—

false

—

OutputTypeShortNames

If true, short names of types will be used in the resulting file, e.g., "Annotation" instead of "uima.tcas.Annotation".

Boolean

False

—

false

—

TargetType

A UIMA type whose instances will be saved. For example, uima.tcas.Annotation.

String

True

—

false

—

TabularExport

Category: Writer
Framework: AlvisNLP
Version: 2012-04-30

Writes the corpus data structure in files in tabular format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

append

—

java.lang.Boolean

False

—

charset

—

java.lang.String

True

—

columns

—

alvisnlp.corpus.expressions.Expression[]

True

—

fileName

—

alvisnlp.corpus.expressions.Expression

True

—

files

—

alvisnlp.corpus.expressions.Expression

True

—

footers

—

alvisnlp.corpus.expressions.Expression[]

False

—

headers

—

alvisnlp.corpus.expressions.Expression[]

False

—

lines

—

alvisnlp.corpus.expressions.Expression

True

—

outDir

—

org.bibliome.util.files.OutputDirectory

True

—

separator

—

java.lang.String

True

—

trim

—

java.lang.Boolean

False

—

TcfWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Writer for the WebLicht TCF format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

Specify the suffix of output files. Default value <code>.tcf</code>. If the suffix is not needed, provide an empty string as value.

String

True

—

false

—

merge

Merge with source TCF file if one is available. Default: true

Boolean

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

preserveIfEmpty

If there are no annotations for a particular layer in the CAS, preserve any potentially existing annotations in the original TCF. Default: false

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

TeiWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

UIMA CAS consumer writing the CAS document text in TEI format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

cTextPattern

A token matching this pattern is rendered as a TEI "c" element instead of a "w" element.

String

True

—

false

—

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

Specify the suffix of output files. Default value <code>.xml</code>. If the suffix is not needed, provide an empty string as value.

String

True

—

false

—

indent

Indent the XML.

Boolean

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

writeConstituent

Write constituent annotations to the CAS. Disabled by default because it requires type priorities to be set up (Constituents must have a higher prio than Tokens).

Boolean

True

—

false

—

writeNamedEntity

Write named entity annotations to the CAS. Overlapping named entities are not supported.

Boolean

True

—

false

—

TextWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

UIMA CAS consumer writing the CAS document text as plain text file.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

Specify the suffix of output files. Default value <code>.txt</code>. If the suffix is not needed, provide an empty string as value.

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

TfidfConsumer

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

This consumer builds a DfModel. It collects the df (document frequency) counts for the processed collection. The counts are serialized as a DfModel-object.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

featurePath

This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path.

String

True

—

false

—

lowercase

If set to true, the whole text is handled in lower case.

Boolean

True

—

false

—

targetLocation

Specifies the path and filename where the model file is written.

String

True

—

false

—

TigerXmlWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

UIMA CAS consumer writing the CAS document text in the TIGER-XML format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

filenameSuffix

Specify the suffix of output files. Default value <code>.xml</code>. If the suffix is not needed, provide an empty string as value.

String

True

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

TokenizedTextWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

This class writes a set of pre-processed documents into a large text file containing one sentence per line and tokens split by whitespaces. Optionally, annotations other than tokens (e.g. lemmas) are written as specified by #PARAM_FEATURE_PATH.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

compression

Choose a compression method. (default: CompressionMethod#NONE)

String

False

—

false

—

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Boolean

True

—

false

—

featurePath

The feature path, e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma/value for lemmas. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token (i.e. token texts). In order to specify a different annotation use the annotation class' type name (e.g. Token.class.getTypeName()) and optionally append a field, e.g. /value to specify the feature path. If you do not specify a field, the covered text is used.

String

True

—

false

—

numberRegex

All tokens that match this regex are replaced by NUM. Examples: <ul> <li>^$ <li>^[0-9,\.]$ <li>^[0-9]+(\.[0-9]*)?$ </ul> Make sure that these regular expressions are fit to the segmentation, e.g. if your work on tokens, your tokenizer might split prefixes such as + and - from the rest of the number.

String

False

—

false

—

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Boolean

True

—

false

—

singularTarget

Boolean

True

—

false

—

stopwordsFile

All the tokens listed in this file (one token per line) are replaced by STOP. Empty lines and lines starting with # are ignored. Casing is ignored.

String

False

—

false

—

stripExtension

Remove the original extension.

Boolean

True

—

false

—

targetEncoding

Encoding for the target file. Default is UTF-8.

String

True

—

false

—

targetLocation

Target location. If this parameter is not yet, data is written to stdout.

String

False

—

false

—

useDocumentId

Use the document ID as file name even if a relative path information is present.

Boolean

True

—

false

—

TwitterDatabaseConsumer

Category: Writer
Framework: NaCTeM (UIMA)
Version: 0.0.1-SNAPSHOT

Web1TWriter

Category: Writer
Framework: DKPro Core (UIMA)
Version: 1.8.0

Web1T n-gram index format writer.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

contextType

The type being used for segments

String

True

—

false

—

createIndexes

Create the indexes that jWeb1T needs to operate. (default: true)

Boolean

False

—

false

—

inputTypes

Types to generate n-grams from.

Example: Token.class.getName() + "/pos/PosValue" for part-of-speech n-grams

String

True

—

true

—

lowercase

Create a lower case index.

Boolean

False

—

false

—

maxNgramLength

Maximum n-gram length.

Default: 3

Integer

False

—

false

—

minFreq

Specifies the minimum frequency a NGram must have to be written to the final index. The specified value is interpreted as inclusive value, the default is 1. Thus, all NGrams with a frequency of at least 1 or higher will be written.

Integer

False

—

false

—

minNgramLength

Minimum n-gram length.

Default: 1

Integer

False

—

false

—

splitFileTreshold

The input file(s) is/are split into smaller files for quick access. An own file is created if the first two starting letters (or the starting letter if the word has a length of 1 character) account for at least x% of all starting letters in the input file(s). The default value for splitting a file is 1.0%. Every word that has starting characters which does not suffice the threshold is written with other words that also did not meet the threshold into an own file for miscellaneous words. A high threshold will lead to only a few, but large files and a most likely very large misc. file. A low threshold results in many small files. Use a zero or a negative value to write everything to one file.

Float

False

—

false

—

targetEncoding

Character encoding of the output data.

String

False

—

false

—

targetLocation

Location to which the output is written.

String

True

—

false

—

WhatsWrongExport

Category: Writer
Framework: AlvisNLP
Version: 2012-04-30

Writes files in What's Wrong with my NLP format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

dependent

—

java.lang.String

True

—

documentFilter

—

alvisnlp.corpus.expressions.Expression

True

—

entities

—

java.lang.String[]

False

—

entityType

—

java.lang.String

False

—

head

—

java.lang.String

True

—

label

—

java.lang.String

True

—

outFile

—

org.bibliome.util.streams.TargetStream

True

—

relationName

—

java.lang.String

True

—

sectionFilter

—

alvisnlp.corpus.expressions.Expression

True

—

sentence

—

java.lang.String

True

—

sentences

—

alvisnlp.corpus.expressions.Expression

True

—

wordForm

—

java.lang.String

True

—

words

—

java.lang.String

True

—

XMI Writer

Category: Writer
Framework: NaCTeM (UIMA)
Version: 1.1

Serialises entires common annotation structures (CAS) to XMI format.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

outputFolder

The folder to write to XMI files.

String

True

—

false

—

overwrite

—

Boolean

True

—

false

—

XMLWriter

Category: Writer
Framework: AlvisNLP
Version: 2010-10-28

Writes an XML serialization of the corpus into a file.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

outFile

—

org.bibliome.util.streams.TargetStream

True

—

XMLWriter2

Category: Writer
Framework: AlvisNLP
Version: 2012-04-30

Writes the corpus data structure into a file via an XSLT stylesheet.

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

fileName

—

alvisnlp.corpus.expressions.Expression

True

—

indent

—

java.lang.Boolean

True

—

outDir

—

org.bibliome.util.files.OutputDirectory

True

—

roots

—

alvisnlp.corpus.expressions.Expression

True

—

xslTransform

—

org.bibliome.util.streams.SourceStream

True

—

XMLWriter2ForINIST

Category: Writer
Framework: AlvisNLP
Version:

synopsis

Parameter

Description

Type

Mandatory

Default Value

Multi-value

Runtime

active

—

alvisnlp.corpus.expressions.Expression

True

—

fileName

—

alvisnlp.corpus.expressions.Expression

True

—

outDir

—

org.bibliome.util.files.OutputDirectory

True

—

roots

—

alvisnlp.corpus.expressions.Expression

True

—

xslTransform

—

org.bibliome.util.streams.SourceStream

True

—

XmiWriter