This article describes how to make use of Google’s Web1T corpus. We use the reader provided by DKPro Core.
In 2006, Google Inc. released a corpus of n-grams with a length of up to 5, as announced in their research blog. The data can be obtained from the Linguistics Data Consortium (LDC, see here) for a fee of $150,- (non-members). For those who find this to costly, there is also a way to build one’s own corpus in Web1T format, using DKPro Core.
How to use it
Reading Web1T files is relatively easy. Include the corresponding Maven dependency in your pom.xml and it is a one-liner. The following snippet extracts all n-grams with a length of 1 to 3. Note that the lower bound must be 1, which is a known bug in version 1.4.0.
final String dkproHome = System.getenv("DKPRO_HOME"); final JWeb1TSearcher web1TSearcher = new JWeb1TSearcher( new File(dkproHome, "web1t/ENGLISH"), 1, 3);
In the context of DKPro, it is always advisable to keep your corpora organized at at directory that is reflected by the environment variable DKPRO_HOME. Many of the DKPro readers will try to find documents below this directory automatically.
Afterwards, you can query the count of any phrase you like (separate multiple tokens with whitespaces):
web1TSearcher.getFrequency("house"); web1TSearcher.getFrequency("like you"); web1TSearcher.getFrequency("what a wonderful life");
Count for ‘house’: 350467
Count for ‘like you’: 1632
Count for ‘What a wonderful’: 40
If you query n-grams that are not in the index, the reader will complain about this. An earlier post describes how to silence these complaints – another way would be to filter n-grams before handing them to the reader.
Where to get it
The code for this tutorial is available on GitHub.
Maven dependency for the Web1T reader:
<dependency> <groupId>de.tudarmstadt.ukp.dkpro.core</groupId> <artifactId>de.tudarmstadt.ukp.dkpro.core.io.web1t-asl</artifactId> </dependency>
The version information of the two dependencies is provided through Maven’s Dependency Management:
<dependency> <groupId>de.tudarmstadt.ukp.dkpro.core</groupId> <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId> <version>1.4.0</version> <type>pom</type> <scope>import</scope> </dependency>