Infinitest is a really great plugin for Eclipse. It executes all JUnit tests it can find whenever something changes. After I downloaded DKPro core, a multi-module Maven project, may CPU was running for quite a long time, until all unit tests had been executed once – even though I just wanted to look at the code…

To keep Infinitest from executing these tests over and over again on the next occasion, I put an infinitest.filters file in every single of the 72(!) modules as follows:

Surely, there is also a one-liner that solves the problem, but one works very reliable for me :-)

This article describes how to make use of Google’s Web1T corpus. We use the reader provided by DKPro Core.

In 2006, Google Inc. released a corpus of n-grams with a length of up to 5, as announced in their research blog. The data can be obtained from the Linguistics Data Consortium (LDC, see here) for a fee of $150,- (non-members). For those who find this to costly, there is also a way to build one’s own corpus in Web1T format, using DKPro Core.

How to use it

Reading Web1T files is relatively easy. Include the corresponding Maven dependency in your pom.xml and it is a one-liner. The following snippet extracts all n-grams with a length of 1 to 3. Note that the lower bound must be 1, which is a known bug in version 1.4.0.

In the context of DKPro, it is always advisable to keep your corpora organized at at directory that is reflected by the environment variable DKPRO_HOME. Many of the DKPro readers will try to find documents below this directory automatically.

Afterwards, you can query the count of any phrase you like (separate multiple tokens with whitespaces):


Count for ‘house’: 350467
Count for ‘like you’: 1632
Count for ‘What a wonderful’: 40

If you query n-grams that are not in the index, the reader will complain about this. An earlier post describes how to silence these complaints – another way would be to filter n-grams before handing them to the reader.

Where to get it

The code for this tutorial is available on GitHub.

Maven dependency for the Web1T reader:

The version information of the two dependencies is provided through Maven’s Dependency Management:


  • [1] Announcement by Google Inc
  • [2] Download Web1T corpus from LDC
  • [3] Instructions on how to build custom Web1T files
  • [4] Code on GitHub

Many applications in NLP use n-grams, for instance for calculating text similarity or machine translation. This tutorial demonstrates how to use DKPro Core in order to obtain the n-grams from a text.

DKPro Core is a UIMA-based framework, so many components heavily build upon UIMA types. There is, however, a way to extract n-grams directly from a list of strings, thereby relieving us of the effort to first kick off a UIMA pipeline in order to tokenize our text. The tutorial comprises two parts:

  1. Token-based n-grams that can be used in UIMA pipelines and
  2. String-based n-grams that can be obtained with a minimum of effort from a list of strings.

The code for this tutorial is available on GitHub.

Token-based N-grams

For this tutorial I use the example sentence “Mary gives John the apple.”  First, we split the sentence into its tokens. As we need to use UIMA components anyway, I use the BreakIteratorSegmenter component for tokenization:

Afterwards, jCas contains the tokens of the sentence and we can build the n-grams from theses tokens: NGramIterable‘s factory method create takes an iterable of tokens and a maximum number for the n in our n-grams. In our case, I want to extract all bigrams and choose n=2.

As with every iterator, we can now use the iterator methods hasNext and next in order to retrieve the n-grams. Unfortunately, the iterator will return all n-grams up to a length of n, i.e., all unigrams/tokens and bigrams. but we only want the bigrams! We can use a little trick to identify the bigrams: A bigram always covers exactly two tokens and so we can use JCasUtil.selectCovered to check how may tokens an n-gram actually subsumes:

That’s it, when we run the application, we get the following output (I omitted some boilerplate/formatting code in the above listings):

Mary gives, gives John, John the, the apple

We notice, that the final period is not included in the bigrams.

String-based N-grams

As for the previous example, I use the sample sentence “Mary gives John the apple.”

Compared to the token-based example, this one is much easier. We replace the segmenter component with a call to String.split. To keep the regular expression simple, I add a whitespace before the period:

A second line of code already produces our desired iterator over the n-grams. Note that we may specify a minimal and maximal n for our n-grams here:

The rest  is almost identical, but our n-grams are now Strings and we do not need to care about the n-grams length. The output code reduces to

In contrast to the token-based approach, this n-gram iterator also produces the bigrams with the period in it:

Mary gives, gives John, John the, the apple, apple .

Where to get it

The code for this tutorial is available on GitHub.

Maven Dependency for the n-gram tools:

Maven dependency for the segmenter/tokenizer components:

The version information of the two dependencies is provided through Maven’s Dependency Management:


  • [1] Working with n-grams (official DKPro Core ASL site)
  • [2] NGramIterable Javadoc (latest)
  • [3] Code on GitHub