WordNet is an invaluable resource for NLP research. John Didion has developed a Java library for accessing WordNet data in a programmatic way. To access WN from Java, the following steps are necessary:
- Download WordNet
- Add a dependency to JWNL to your project or download the library.
- Configure properties.xml so that JWNL knows where to find WordNet and which version is used.
- Create Dictionary instance for querying WordNet.
The configuration is stored in an XML file that sets the path where WordNet can be found. If you use a standard WN distribution, then the path should end in dict as the following minimalistic properties.xml illustrates:
<?xml version="1.0" encoding="UTF-8"?> <jwnl_properties language="en"> <version publisher="Princeton" number="3.0" language="en"/> <dictionary> <param name="dictionary_element_factory" value="net.didion.jwnl.princeton.data.PrincetonWN17FileDictionaryElementFactory"/> <param name="file_manager" value="net.didion.jwnl.dictionary.file_manager.FileManagerImpl"> <param name="file_type" value="net.didion.jwnl.princeton.file.PrincetonRandomAccessDictionaryFile"/> <param name="dictionary_path" value="path-to-dict"/> </param> </dictionary> <resource/> </jwnl_properties>
On GitHub, you find two prepared properties files:
- properties_min.xml uses only a minimum of the possible settings
- properties.xml includes a rule-based morphological stemmer that allows you to query for inflected forms, e.g., houses, runs, dogs
A singleton instance of Dictionary is used to query WordNet with JWNL. In fact, setting up the dictionary is very easy:
JWNL.initialize(new FileInputStream("src/main/resources/properties.xml")); final Dictionary dictionary = Dictionary.getInstance();
Afterwards, you can easily query the dictionary for a lemma of your choice (try house, houses, dog). For each lemma, you also specify one of the 4 possible part-of-speech classes that you are looking for, that is one of POS.ADJECTIVE, POS.ADVERB, POS.NOUN, POS.VERB. For house you would choose POS.NOUN or POS.VERB. The whole process looks rather clumsy, so I listed the steps below:
- Lookup: Is the lemma in the dictionary?
final IndexWord indexWord = dictionary.lookupIndexWord(pos, lemma);
- If the lookup fails, indexWord is null.
- What different senses may the lemma have?
final Synset senses = indexWord.getSenses();
- For each sense, we may get a short description of the sense, called the gloss.
final String gloss = synset.getGloss();
- What other lemmas are in a synset?
final Word words = synset.getWords();
- For each word, we may get its lemma and its POS: word.getLemma(); and word.getPOS().getKey();
Where to get it
The code for this tutorial is available on GitHub. You need to copy the template properties file(s) in src/main/resources before you can run the code. Given an lemma and part-of-speech, the program returns the list of synsets that contain the lemma. For house/v the output looks like so:
Aug 23, 2013 9:13:40 AM net.didion.jwnl.dictionary.Dictionary doLog
INFO: Installing dictionary net.didion.jwnl.dictionary.FileBackedDictionary@6791d8c1
1 Lemmas: [house/v] (Gloss: contain or cover; “This box houses the gears”)
2 Lemmas: [house/v, put_up/v, domiciliate/v] (Gloss: provide housing for; “The immigrants were housed in a new development outside the town”)
Maven dependency for JWNL reader and the necessary logging:
<dependency> <groupId>net.didion.jwnl</groupId> <artifactId>jwnl</artifactId> <version>1.4.0.rc2</version> </dependency> <dependency> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> <version>1.1.3</version> </dependency>