UIMA’s logging facility can be used even outside of components:

// Logger with current class as name
UIMAFramework.getLogger().log(Level.INFO, "Processing started");

// Logger with a specific class as name
UIMAFramework.getLogger(Main.class).log(Level.INFO, "Processing started");

Detailed information on logging in UIMA (how to configure,…) can be found in the UIMA tutorial, 1.2.2 Logging.

References

  • [1] UIMA tutorial (2.1.0) – 1.2.2.5 Using the logger outside an annotator

This article describes how to serialize and deserialize JCas objects using DKPro’s  XmiWriter and XmiReader components. A runnable Maven project can be found on GitHub.

Dependencies

Only one dependency is necessary, which is available on Maven Central:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.io.xmi-asl</artifactId>
  <version>1.5.0</version>
</dependency>

As usual in the context of DKPro Core, it is better to omit the version tag and to configure the version of DKPro Core centrally:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
      <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
      <version>1.5.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

Serialization

The basic code for serialization looks as follows:

// import de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiWriter;
final AnalysisEngineDescription xmiWriter = 
AnalysisEngineFactory.createEngineDescription(
        XmiWriter.class,
        XmiWriter.PARAM_TARGET_LOCATION, "./target/cache");

The target location is the folder where the cached JCases will be stored. You may either pass a String or a File object. Each JCas needs a DocumentMetaData feature structure in order to know the target filename. The filename can either be configured via DocumentMetaData.setDocumentId(String) or via setBaseURI(String) and setURI(String). For details, look at the provided sample project.

Deserialization

The deserialization works analogously, but of course, the XmiReader is not a consumer but a reader component and has to be the first component in the Pipeline:

//import de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiReader;
final CollectionReaderDescription xmiReader =
  CollectionReaderFactory.createReaderDescription(
      XmiReader.class,
      XmiReader.PARAM_SOURCE_LOCATION,  "./target/cache",
      XmiReader.PARAM_PATTERNS, "[+]*.xmi");

The source location is identical to the target location of the writer. Additionally, the reader requires a pattern, that describes files to include (“[+]”) and exclude (“[-]”). Patterns obey to the format of Ant patterns.

Download

If you are interested in a “minimal working example”, you can find a Maven project on GitHub.

References

  • [1] Ant patterns

 

In a previous post, I described how to simplify the task of creating initialize and collectionProcessComplete methods using Eclipse templates. A similar technique can be applied to generate the idiomatic code for a uimaFit configuration parameter.

The shape of UIMA configuration parameters

An example of configuring UIMA components using uimaFit’s @ConfigurationParameter looks like this:

public static final String PARAM_PARAMETER = "parameter";
@ConfigurationParameter(name = PARAM_PARAMETER, mandatory = false, description = "Description", defaultValue = "true")
private boolean parameter;

This snippet demonstrates some best practices that the community or I introduced:

  • The name-giving constant (here: PARAM_PARAMETER) is always public and bears the name of the parameter field as its value (here: “parameter”)
  • You should always provide a default value for non-mandatory parameters. The defaultValue property is always a String and the toString method normally cannot be applied due to compile time restrictions.
  • If a defaultValue attribute is given, uimaFit injects the default value automatically. That means that after the initialize method is complete, parameter has the value true.

For mandatory configuration parameters, only two parts have to be changed:

  1. There is no more defaultValue.
  2. The attribute mandatory is set to true.

Template for optional parameters

In order to create a new code template, open Window -> Preferences -> Java/Editor/Templates and Use the following metadata:

  • Name: uima_optional_param
  • Context: Java
  • Description: Generates an optional uimaFit configuration parameter

Pattern:

${imp:import(org.apache.uima.fit.descriptor.ConfigurationParameter)}
public static final String PARAM_${paramNameCapitalized} = "${paramName}";
@ConfigurationParameter(name = PARAM_${paramNameCapitalized}, mandatory = false, 
    description = "${description}", defaultValue = "${defaultValue}")
private ${type} ${paramName};${cursor}

When you now type uima_optional_param in an editor, you will be prompted for the different parts of the template:

  • type is the type of the new parameter. You may choose any primitive type (int, boolean,…), any class that has a ‘single String’ constructor (e.g. File), enum types, and Locale/Pattern.
  • paramName is the parameter name. In Java, you should camel-case it, e.g. shallDeleteAll would be a valid parameter name.
  • paramNameCaptialized is the capitalized form of the parameter name. You should follow Java coding conventions, as illustrated in the following example: Parameter shallDeleteAll yields the  constant PARAM_SHALL_DELETE_ALL.
  • defaultValue is the default value of the configuration parameter. It will be injected into the class member by uimaFit.

Template for mandatory configuration parameters

Most things that I described about optional configuration parameters also apply to mandatory configuration parameters. The template for mandatory parameters can be created as follows: Create a new code template in Window -> Preferences -> Java/Editor/Templates and use the following metadata:

  • Name: uima_mandatory_param
  • Context: Java
  • Description: Generates a mandatory uimaFit configuration parameter

Pattern:

${imp:import(org.apache.uima.fit.descriptor.ConfigurationParameter)}
public static final String PARAM_${paramNameCapitalized} = "${paramName}";
@ConfigurationParameter(name = PARAM_${paramNameCapitalized}, mandatory = true, description = "${description}")
private ${type} ${paramName};${cursor}

The variables in the template have the exact same meaning as in the template for optional parameters.

 References

  • [1] uimaFit wiki on @ConfigurationParameter

When I write UIMA annotators/consumers in Java, often code needs to be executed before and after the processing of all CASes. For this purpose, the base classes for annotators (JCasAnnotator_ImplBase) and consumers (JCasConsumer_ImplBase) offer two methods:

  • initialize(UimaContext) gets called in the beginning and
  • collectionProcessComplete() get called when the complete collection has been processed

Re-typing the skeletons of these two methods is tedious and you may even forget to execute the essential super.initialize(UimaContext) in the first method (otherwise your component will be in a rotten state!).

In Eclipse you can create templates for such tasks. Open Window -> Preferences -> Java/Editor/Templates and select New. We create templates for both methods separately.

Initialize

Use the following metadata:

  • Name: uima_init
  • Context: Java
  • Description: Generates a stub of the overriden initialize(UimaContext) method.

As pattern:

${imp:import(org.apache.uima.UimaContext, org.apache.uima.resource.ResourceInitializationException)}

@Override
public void initialize(final UimaContext context) throws ResourceInitializationException
{
    super.initialize(context);
    ${cursor}
}

Besides the known Java code, the pattern defines that ResourceInitializationException should be imported and that the cursor is located after the call to super.initialize.

The template can be inserted by typing uima_init in the editor and selecting the template from the menu via Enter.

CollectionProcessComplete

Use the following metadata:

  • Name: uima_complete
  • Context:  Java
  • Description: Generates a stub of the overriden collectionProcessComplete() method.

As pattern:

${imp:import(org.apache.uima.analysis_engine.AnalysisEngineProcessException)}

@Override
public void collectionProcessComplete() throws AnalysisEngineProcessException
{
    ${cursor}
}

Besides the known Java code, the pattern defines that AnalysisEngineProcessException should be imported and that the cursor is located inside the new method.

The template can be inserted by typing uima_complete in the editor and selecting the template from the menu via Enter.

References

  • [1] Stackoverflow thread with many useful templates (or links to them)
  • [2] JCasConsumer_ImplBase (uimaFit 2.0.0)
  • [3] JCasAnnotator_ImplBase (uimaFit 2.0.0)

DKPro Core contains a component that wraps the popular TreeTagger.

Unfortunately, only the core component de.tudarmstadt.ukp.dkpro.core.treetagger-asl is directly available as Maven artifact, while license restrictions disallow to redistribute the binaries (de.tudarmstadt.ukp.dkpro.core.treetagger-bin) and the models (de.tudarmstadt.ukp.dkpro.core.treetagger-model-{de,en,fr,…}). The DKPro Core developer team provides instructions on how to create the latter artifacts, using an ant build.xml script.

The Maven dependencies of the TreeTagger component look as follows. It is important to use dependency management in order to coordinate the versions of the three artifacts.

<dependencies>
  <dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core.treetagger-asl</artifactId>
  </dependency>
  <dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core.treetagger-bin</artifactId>
  </dependency>
  <dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core.treetagger-model-de</artifactId>
  </dependency>
</dependencies>
<dependencyManagement>
  <dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core.treetagger-asl</artifactId>
    <version>1.5.0</version>
    <type>pom</type>
    <scope>import</scope>
  </dependency>
</dependencyManagement>

 References

  • [1] TreeTagger project site
  • [2] Instructions on packaging the binary and model artifacts

This article describes how to make use of Google’s Web1T corpus. We use the reader provided by DKPro Core.

In 2006, Google Inc. released a corpus of n-grams with a length of up to 5, as announced in their research blog. The data can be obtained from the Linguistics Data Consortium (LDC, see here) for a fee of $150,- (non-members). For those who find this to costly, there is also a way to build one’s own corpus in Web1T format, using DKPro Core.

How to use it

Reading Web1T files is relatively easy. Include the corresponding Maven dependency in your pom.xml and it is a one-liner. The following snippet extracts all n-grams with a length of 1 to 3. Note that the lower bound must be 1, which is a known bug in version 1.4.0.

final String dkproHome = System.getenv("DKPRO_HOME");
final JWeb1TSearcher web1TSearcher = new JWeb1TSearcher(
   new File(dkproHome, "web1t/ENGLISH"), 1, 3);

In the context of DKPro, it is always advisable to keep your corpora organized at at directory that is reflected by the environment variable DKPRO_HOME. Many of the DKPro readers will try to find documents below this directory automatically.

Afterwards, you can query the count of any phrase you like (separate multiple tokens with whitespaces):

web1TSearcher.getFrequency("house");
web1TSearcher.getFrequency("like you");
web1TSearcher.getFrequency("what a wonderful life");

Results:

Count for ‘house’: 350467
Count for ‘like you’: 1632
Count for ‘What a wonderful’: 40

If you query n-grams that are not in the index, the reader will complain about this. An earlier post describes how to silence these complaints – another way would be to filter n-grams before handing them to the reader.

Where to get it

The code for this tutorial is available on GitHub.

Maven dependency for the Web1T reader:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.io.web1t-asl</artifactId>
</dependency>

The version information of the two dependencies is provided through Maven’s Dependency Management:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
  <version>1.4.0</version>
  <type>pom</type>
  <scope>import</scope>
</dependency>

Links

  • [1] Announcement by Google Inc
  • [2] Download Web1T corpus from LDC
  • [3] Instructions on how to build custom Web1T files
  • [4] Code on GitHub

Many applications in NLP use n-grams, for instance for calculating text similarity or machine translation. This tutorial demonstrates how to use DKPro Core in order to obtain the n-grams from a text.

DKPro Core is a UIMA-based framework, so many components heavily build upon UIMA types. There is, however, a way to extract n-grams directly from a list of strings, thereby relieving us of the effort to first kick off a UIMA pipeline in order to tokenize our text. The tutorial comprises two parts:

  1. Token-based n-grams that can be used in UIMA pipelines and
  2. String-based n-grams that can be obtained with a minimum of effort from a list of strings.

The code for this tutorial is available on GitHub.

Token-based N-grams

For this tutorial I use the example sentence “Mary gives John the apple.”  First, we split the sentence into its tokens. As we need to use UIMA components anyway, I use the BreakIteratorSegmenter component for tokenization:

final String sentence = "Mary gives John the apple.";

final JCas jCas = JCasFactory.createJCas();
jCas.setDocumentText(sentence);
jCas.setDocumentLanguage("en");

final AnalysisEngineDescription breakIteratorSegmenter = 
   AnalysisEngineFactory.createPrimitiveDescription(
   BreakIteratorSegmenter.class);
SimplePipeline.runPipeline(jCas, breakIteratorSegmenter);

Afterwards, jCas contains the tokens of the sentence and we can build the n-grams from theses tokens: NGramIterable‘s factory method create takes an iterable of tokens and a maximum number for the n in our n-grams. In our case, I want to extract all bigrams and choose n=2.

final Collection<Token> tokens = JCasUtil.select(jCas, Token.class);
final NGramIterable<Token> ngrams = NGramIterable.create(tokens, n);
final Iterator<NGram> ngramIterator = ngrams.iterator();

As with every iterator, we can now use the iterator methods hasNext and next in order to retrieve the n-grams. Unfortunately, the iterator will return all n-grams up to a length of n, i.e., all unigrams/tokens and bigrams. but we only want the bigrams! We can use a little trick to identify the bigrams: A bigram always covers exactly two tokens and so we can use JCasUtil.selectCovered to check how may tokens an n-gram actually subsumes:

final NGram ngram = ngramIterator.next();
if (JCasUtil.selectCovered(Token.class, ngram).size() == n) {
    System.out.print(ngram.getCoveredText());
}

That’s it, when we run the application, we get the following output (I omitted some boilerplate/formatting code in the above listings):

Mary gives, gives John, John the, the apple

We notice, that the final period is not included in the bigrams.

String-based N-grams

As for the previous example, I use the sample sentence “Mary gives John the apple.”

Compared to the token-based example, this one is much easier. We replace the segmenter component with a call to String.split. To keep the regular expression simple, I add a whitespace before the period:

final String[] tokens = 
    sentence.replace(".", " .").split("\\s");

A second line of code already produces our desired iterator over the n-grams. Note that we may specify a minimal and maximal n for our n-grams here:

final Iterator<String> ngramIterator = 
    new NGramStringIterable(tokens, 2, 2).iterator();

The rest  is almost identical, but our n-grams are now Strings and we do not need to care about the n-grams length. The output code reduces to

final String ngram = ngramIterator.next();
System.out.print(ngram);

In contrast to the token-based approach, this n-gram iterator also produces the bigrams with the period in it:

Mary gives, gives John, John the, the apple, apple .

Where to get it

The code for this tutorial is available on GitHub.

Maven Dependency for the n-gram tools:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.ngrams-asl</artifactId>
</dependency>

Maven dependency for the segmenter/tokenizer components:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.tokit-asl</artifactId>
</dependency>

The version information of the two dependencies is provided through Maven’s Dependency Management:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
  <version>1.4.0</version>
  <type>pom</type>
  <scope>import</scope>
</dependency>

Links

  • [1] Working with n-grams (official DKPro Core ASL site)
  • [2] NGramIterable Javadoc (latest)
  • [3] Code on GitHub

The toolkit uimaFIT allows you to design annotation types in the so-called Component Descriptor Editor, save the descriptor as XML and generate the corresponding Java classes. When you run an analysis engine that makes use of these annotations uimaFit checks whether it finds the type system descriptor XML file on the classpath. In certain circumstances, it may happen that it falls short of finding this file and will abort with the following error message:

Caused by: org.apache.uima.cas.CASRuntimeException: 
JCas type "de.tudarmstadt.ukp.teaching.nlp4web.ml.type.NamedEntity" used in Java code, but was not declared in the XML type descriptor.
    at org.apache.uima.jcas.impl.JCasImpl.getType(JCasImpl.java:412)
    at org.apache.uima.jcas.cas.TOP.<init>(TOP.java:92)
    at org.apache.uima.jcas.cas.AnnotationBase.<init>(AnnotationBase.java:53)
    at org.apache.uima.jcas.tcas.Annotation.<init>(Annotation.java:54)

In this case you need to explicitly point uimaFIT to the location of your type descriptor file. For this to be done, there exist two different ways.

Solution via VM argument

As a quick and dirty solution you add the following VM argument which in Eclipse is configured in the Launch Configuration dialog

-Dorg.uimafit.type.import_pattern=classpath*:desc/types/**/*.xml

In my case, I stored the XML file in src/main/resources/desc/types/TypeSystem.xml where src/main/resources is set to be a source folder in Eclipse (Maven project layout).

Solution via types.txt (suggested)

The issue with the solution above is that your code will break if other people try to execute it without knowing about the VM parameter. There exists a more stable way to solve this issue. uimaFIT looks into the file

  • <classpath>/META-INF/org.uimafit/types.txt  (for uimaFit until 1.4.x, still supported by the uimafit-legacy-support package)
  • <classpath>/META-INF/org.apache.uima.fit/types.txt (for uimaFit 2.0.0 and above)

in order to find out where to search for the XML type descriptor files (In a Maven context META-INF should be located in src/main/resources). Each line in this types.txt file describes one search pattern. In our example, types.txt should contain one line:

classpath*:desc/types/**/*.xml

Links

  • [1] uimaFit Guide and Reference – 8.1. Making types auto-detectable
  • [2] DKPro tutorial (UIMA part): Type System Auto-Discovery (Slide 37)
  • [3] TypeDescriptorDetection