Many applications in NLP use n-grams, for instance for calculating text similarity or machine translation. This tutorial demonstrates how to use DKPro Core in order to obtain the n-grams from a text.
DKPro Core is a UIMA-based framework, so many components heavily build upon UIMA types. There is, however, a way to extract n-grams directly from a list of strings, thereby relieving us of the effort to first kick off a UIMA pipeline in order to tokenize our text. The tutorial comprises two parts:
- Token-based n-grams that can be used in UIMA pipelines and
- String-based n-grams that can be obtained with a minimum of effort from a list of strings.
The code for this tutorial is available on GitHub.
Token-based N-grams
For this tutorial I use the example sentence “Mary gives John the apple.” First, we split the sentence into its tokens. As we need to use UIMA components anyway, I use the BreakIteratorSegmenter component for tokenization:
final String sentence = "Mary gives John the apple."; final JCas jCas = JCasFactory.createJCas(); jCas.setDocumentText(sentence); jCas.setDocumentLanguage("en"); final AnalysisEngineDescription breakIteratorSegmenter = AnalysisEngineFactory.createPrimitiveDescription( BreakIteratorSegmenter.class); SimplePipeline.runPipeline(jCas, breakIteratorSegmenter);
Afterwards, jCas contains the tokens of the sentence and we can build the n-grams from theses tokens: NGramIterable‘s factory method create takes an iterable of tokens and a maximum number for the n in our n-grams. In our case, I want to extract all bigrams and choose n=2.
final Collection<Token> tokens = JCasUtil.select(jCas, Token.class); final NGramIterable<Token> ngrams = NGramIterable.create(tokens, n); final Iterator<NGram> ngramIterator = ngrams.iterator();
As with every iterator, we can now use the iterator methods hasNext and next in order to retrieve the n-grams. Unfortunately, the iterator will return all n-grams up to a length of n, i.e., all unigrams/tokens and bigrams. but we only want the bigrams! We can use a little trick to identify the bigrams: A bigram always covers exactly two tokens and so we can use JCasUtil.selectCovered to check how may tokens an n-gram actually subsumes:
final NGram ngram = ngramIterator.next(); if (JCasUtil.selectCovered(Token.class, ngram).size() == n) { System.out.print(ngram.getCoveredText()); }
That’s it, when we run the application, we get the following output (I omitted some boilerplate/formatting code in the above listings):
Mary gives, gives John, John the, the apple
We notice, that the final period is not included in the bigrams.
String-based N-grams
As for the previous example, I use the sample sentence “Mary gives John the apple.”
Compared to the token-based example, this one is much easier. We replace the segmenter component with a call to String.split. To keep the regular expression simple, I add a whitespace before the period:
final String[] tokens = sentence.replace(".", " .").split("\\s");
A second line of code already produces our desired iterator over the n-grams. Note that we may specify a minimal and maximal n for our n-grams here:
final Iterator<String> ngramIterator = new NGramStringIterable(tokens, 2, 2).iterator();
The rest is almost identical, but our n-grams are now Strings and we do not need to care about the n-grams length. The output code reduces to
final String ngram = ngramIterator.next(); System.out.print(ngram);
In contrast to the token-based approach, this n-gram iterator also produces the bigrams with the period in it:
Mary gives, gives John, John the, the apple, apple .
Where to get it
The code for this tutorial is available on GitHub.
Maven Dependency for the n-gram tools:
<dependency> <groupId>de.tudarmstadt.ukp.dkpro.core</groupId> <artifactId>de.tudarmstadt.ukp.dkpro.core.ngrams-asl</artifactId> </dependency>
Maven dependency for the segmenter/tokenizer components:
<dependency> <groupId>de.tudarmstadt.ukp.dkpro.core</groupId> <artifactId>de.tudarmstadt.ukp.dkpro.core.tokit-asl</artifactId> </dependency>
The version information of the two dependencies is provided through Maven’s Dependency Management:
<dependency> <groupId>de.tudarmstadt.ukp.dkpro.core</groupId> <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId> <version>1.4.0</version> <type>pom</type> <scope>import</scope> </dependency>