Many applications in NLP use n-grams, for instance for calculating text similarity or machine translation. This tutorial demonstrates how to use DKPro Core in order to obtain the n-grams from a text.
DKPro Core is a UIMA-based framework, so many components heavily build upon UIMA types. There is, however, a way to extract n-grams directly from a list of strings, thereby relieving us of the effort to first kick off a UIMA pipeline in order to tokenize our text. The tutorial comprises two parts:
- Token-based n-grams that can be used in UIMA pipelines and
- String-based n-grams that can be obtained with a minimum of effort from a list of strings.
The code for this tutorial is available on GitHub.
Token-based N-grams
For this tutorial I use the example sentence “Mary gives John the apple.” First, we split the sentence into its tokens. As we need to use UIMA components anyway, I use the BreakIteratorSegmenter component for tokenization:
final String sentence = "Mary gives John the apple.";
final JCas jCas = JCasFactory.createJCas();
jCas.setDocumentText(sentence);
jCas.setDocumentLanguage("en");
final AnalysisEngineDescription breakIteratorSegmenter =
AnalysisEngineFactory.createPrimitiveDescription(
BreakIteratorSegmenter.class);
SimplePipeline.runPipeline(jCas, breakIteratorSegmenter);
Afterwards, jCas contains the tokens of the sentence and we can build the n-grams from theses tokens: NGramIterable‘s factory method create takes an iterable of tokens and a maximum number for the n in our n-grams. In our case, I want to extract all bigrams and choose n=2.
final Collection<Token> tokens = JCasUtil.select(jCas, Token.class);
final NGramIterable<Token> ngrams = NGramIterable.create(tokens, n);
final Iterator<NGram> ngramIterator = ngrams.iterator();
As with every iterator, we can now use the iterator methods hasNext and next in order to retrieve the n-grams. Unfortunately, the iterator will return all n-grams up to a length of n, i.e., all unigrams/tokens and bigrams. but we only want the bigrams! We can use a little trick to identify the bigrams: A bigram always covers exactly two tokens and so we can use JCasUtil.selectCovered to check how may tokens an n-gram actually subsumes:
final NGram ngram = ngramIterator.next();
if (JCasUtil.selectCovered(Token.class, ngram).size() == n) {
System.out.print(ngram.getCoveredText());
}
That’s it, when we run the application, we get the following output (I omitted some boilerplate/formatting code in the above listings):
Mary gives, gives John, John the, the apple
We notice, that the final period is not included in the bigrams.
String-based N-grams
As for the previous example, I use the sample sentence “Mary gives John the apple.”
Compared to the token-based example, this one is much easier. We replace the segmenter component with a call to String.split. To keep the regular expression simple, I add a whitespace before the period:
final String[] tokens =
sentence.replace(".", " .").split("\\s");
A second line of code already produces our desired iterator over the n-grams. Note that we may specify a minimal and maximal n for our n-grams here:
final Iterator<String> ngramIterator =
new NGramStringIterable(tokens, 2, 2).iterator();
The rest is almost identical, but our n-grams are now Strings and we do not need to care about the n-grams length. The output code reduces to
final String ngram = ngramIterator.next();
System.out.print(ngram);
In contrast to the token-based approach, this n-gram iterator also produces the bigrams with the period in it:
Mary gives, gives John, John the, the apple, apple .
Where to get it
The code for this tutorial is available on GitHub.
Maven Dependency for the n-gram tools:
<dependency>
<groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
<artifactId>de.tudarmstadt.ukp.dkpro.core.ngrams-asl</artifactId>
</dependency>
Maven dependency for the segmenter/tokenizer components:
<dependency>
<groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
<artifactId>de.tudarmstadt.ukp.dkpro.core.tokit-asl</artifactId>
</dependency>
The version information of the two dependencies is provided through Maven’s Dependency Management:
<dependency>
<groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
<artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
<version>1.4.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
Links
- [1] Working with n-grams (official DKPro Core ASL site)
- [2] NGramIterable Javadoc (latest)
- [3] Code on GitHub