(De-)serializing a JCas in XMI with DKPro Core

This article describes how to serialize and deserialize JCas objects using DKPro’s  XmiWriter and XmiReader components. A runnable Maven project can be found on GitHub.

Dependencies

Only one dependency is necessary, which is available on Maven Central:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.io.xmi-asl</artifactId>
  <version>1.5.0</version>
</dependency>

As usual in the context of DKPro Core, it is better to omit the version tag and to configure the version of DKPro Core centrally:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
      <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
      <version>1.5.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

Serialization

The basic code for serialization looks as follows:

// import de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiWriter;
final AnalysisEngineDescription xmiWriter = 
AnalysisEngineFactory.createEngineDescription(
        XmiWriter.class,
        XmiWriter.PARAM_TARGET_LOCATION, "./target/cache");

The target location is the folder where the cached JCases will be stored. You may either pass a String or a File object. Each JCas needs a DocumentMetaData feature structure in order to know the target filename. The filename can either be configured via DocumentMetaData.setDocumentId(String) or via setBaseURI(String) and setURI(String). For details, look at the provided sample project.

Deserialization

The deserialization works analogously, but of course, the XmiReader is not a consumer but a reader component and has to be the first component in the Pipeline:

//import de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiReader;
final CollectionReaderDescription xmiReader =
  CollectionReaderFactory.createReaderDescription(
      XmiReader.class,
      XmiReader.PARAM_SOURCE_LOCATION,  "./target/cache",
      XmiReader.PARAM_PATTERNS, "[+]*.xmi");

The source location is identical to the target location of the writer. Additionally, the reader requires a pattern, that describes files to include (“[+]”) and exclude (“[-]”). Patterns obey to the format of Ant patterns.

Download

If you are interested in a “minimal working example”, you can find a Maven project on GitHub.

References

  • [1] Ant patterns

 

Leave a Reply