DKPro Core: n-grams (token-based and String-based)

Many applications in NLP use n-grams, for instance for calculating text similarity or machine translation. This tutorial demonstrates how to use DKPro Core in order to obtain the n-grams from a text.

DKPro Core is a UIMA-based framework, so many components heavily build upon UIMA types. There is, however, a way to extract n-grams directly from a list of strings, thereby relieving us of the effort to first kick off a UIMA pipeline in order to tokenize our text. The tutorial comprises two parts:

  1. Token-based n-grams that can be used in UIMA pipelines and
  2. String-based n-grams that can be obtained with a minimum of effort from a list of strings.

The code for this tutorial is available on GitHub.

Token-based N-grams

For this tutorial I use the example sentence “Mary gives John the apple.”  First, we split the sentence into its tokens. As we need to use UIMA components anyway, I use the BreakIteratorSegmenter component for tokenization:

final String sentence = "Mary gives John the apple.";

final JCas jCas = JCasFactory.createJCas();
jCas.setDocumentText(sentence);
jCas.setDocumentLanguage("en");

final AnalysisEngineDescription breakIteratorSegmenter = 
   AnalysisEngineFactory.createPrimitiveDescription(
   BreakIteratorSegmenter.class);
SimplePipeline.runPipeline(jCas, breakIteratorSegmenter);

Afterwards, jCas contains the tokens of the sentence and we can build the n-grams from theses tokens: NGramIterable‘s factory method create takes an iterable of tokens and a maximum number for the n in our n-grams. In our case, I want to extract all bigrams and choose n=2.

final Collection<Token> tokens = JCasUtil.select(jCas, Token.class);
final NGramIterable<Token> ngrams = NGramIterable.create(tokens, n);
final Iterator<NGram> ngramIterator = ngrams.iterator();

As with every iterator, we can now use the iterator methods hasNext and next in order to retrieve the n-grams. Unfortunately, the iterator will return all n-grams up to a length of n, i.e., all unigrams/tokens and bigrams. but we only want the bigrams! We can use a little trick to identify the bigrams: A bigram always covers exactly two tokens and so we can use JCasUtil.selectCovered to check how may tokens an n-gram actually subsumes:

final NGram ngram = ngramIterator.next();
if (JCasUtil.selectCovered(Token.class, ngram).size() == n) {
    System.out.print(ngram.getCoveredText());
}

That’s it, when we run the application, we get the following output (I omitted some boilerplate/formatting code in the above listings):

Mary gives, gives John, John the, the apple

We notice, that the final period is not included in the bigrams.

String-based N-grams

As for the previous example, I use the sample sentence “Mary gives John the apple.”

Compared to the token-based example, this one is much easier. We replace the segmenter component with a call to String.split. To keep the regular expression simple, I add a whitespace before the period:

final String[] tokens = 
    sentence.replace(".", " .").split("\\s");

A second line of code already produces our desired iterator over the n-grams. Note that we may specify a minimal and maximal n for our n-grams here:

final Iterator<String> ngramIterator = 
    new NGramStringIterable(tokens, 2, 2).iterator();

The rest  is almost identical, but our n-grams are now Strings and we do not need to care about the n-grams length. The output code reduces to

final String ngram = ngramIterator.next();
System.out.print(ngram);

In contrast to the token-based approach, this n-gram iterator also produces the bigrams with the period in it:

Mary gives, gives John, John the, the apple, apple .

Where to get it

The code for this tutorial is available on GitHub.

Maven Dependency for the n-gram tools:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.ngrams-asl</artifactId>
</dependency>

Maven dependency for the segmenter/tokenizer components:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.tokit-asl</artifactId>
</dependency>

The version information of the two dependencies is provided through Maven’s Dependency Management:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
  <version>1.4.0</version>
  <type>pom</type>
  <scope>import</scope>
</dependency>

Links

  • [1] Working with n-grams (official DKPro Core ASL site)
  • [2] NGramIterable Javadoc (latest)
  • [3] Code on GitHub

Leave a Reply