This article describes how to serialize and deserialize JCas objects using DKPro’s  XmiWriter and XmiReader components. A runnable Maven project can be found on GitHub.

Dependencies

Only one dependency is necessary, which is available on Maven Central:

<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.io.xmi-asl</artifactId>
  <version>1.5.0</version>
</dependency>

As usual in the context of DKPro Core, it is better to omit the version tag and to configure the version of DKPro Core centrally:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
      <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
      <version>1.5.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

Serialization

The basic code for serialization looks as follows:

// import de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiWriter;
final AnalysisEngineDescription xmiWriter = 
AnalysisEngineFactory.createEngineDescription(
        XmiWriter.class,
        XmiWriter.PARAM_TARGET_LOCATION, "./target/cache");

The target location is the folder where the cached JCases will be stored. You may either pass a String or a File object. Each JCas needs a DocumentMetaData feature structure in order to know the target filename. The filename can either be configured via DocumentMetaData.setDocumentId(String) or via setBaseURI(String) and setURI(String). For details, look at the provided sample project.

Deserialization

The deserialization works analogously, but of course, the XmiReader is not a consumer but a reader component and has to be the first component in the Pipeline:

//import de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiReader;
final CollectionReaderDescription xmiReader =
  CollectionReaderFactory.createReaderDescription(
      XmiReader.class,
      XmiReader.PARAM_SOURCE_LOCATION,  "./target/cache",
      XmiReader.PARAM_PATTERNS, "[+]*.xmi");

The source location is identical to the target location of the writer. Additionally, the reader requires a pattern, that describes files to include (“[+]”) and exclude (“[-]”). Patterns obey to the format of Ant patterns.

Download

If you are interested in a “minimal working example”, you can find a Maven project on GitHub.

References

  • [1] Ant patterns

 

If your software produces costly objects, object serialization may be an option to spare you some bootstrapping time, e.g., when you repeatedly restart your application during development. Apache Commons-Lang offers an implementation of serialization that is an epitome of ease of use: SerializationUtils. The core methods are serialize and deserialize.

Given an object, the actual process of serialization is a one-line statement (split up here):

final File targetFile = new File("./target/serializedObject.ser");
final BufferedOutputStream outStream = new BufferedOutputStream(new FileOutputStream(targetFile));
SerializationUtils.serialize(object, outStream);

The same holds for the deserialization process. In these subsequent lines, we assume that the object to be derserialized is an instance of java.lang.String:

final BufferedInputStream inStream = new BufferedInputStream(new FileInputStream(targetFile));
final String string = (String) SerializationUtils.deserialize(inStream);

A complete executable example can be found on GitHub.

Maven Dependency (a Jar file for download can be found here):

<dependency>
    <groupId>commons-lang</groupId>
    <artifactId>commons-lang</artifactId>
    <version>2.2</version>
</dependency>

Links

  • [1] SerializationUtils Javadoc
  • [2] Executable example code (Maven project)