Reading and writing SSSOM mapping sets

1. Reading mappings

1.1. TSV format

The TSV format is the main supported format. Both the “internal metadata” variant (where the metadata are included directly within the TSV file, as a commented header) and the ”external” variant (where the metadata are in a distinct file) are supported.

Use the TSVReader class to read a mapping set from a TSV file:

import org.incenp.obofoundry.sssom.TSVReader;
import org.incenp.obofoundry.sssom.SSSOMFormatException;
import org.incenp.obofoundry.sssom.model.MappingSet;

try {
    TSVReader reader = new TSVReader("my-mappings.sssom.tsv");
    MappingSet ms = reader.read();
} catch (IOException ioe) {
    // A non-SSSOM I/O error occured
} catch (SSSOMFormatException sfe) {
    // Invalid SSSOM data
}

If a metadata header is not found in the specified file, the reader will automatically look for a file with the same name as the TSV file but with a yml extension, and read the metadata from that file. If no such file exists, the reader will throw a SSSOMFormatException.

Use the two-arguments variant of the constructor to explicitly specify the name of an external metadata file:

TSVReader reader = new TSVReader("my-mappings.sssom.tsv", "my-meta.yml");

Sometimes it may be useful to read only the metadata, not the mappings themselves. There are two ways to do so: either by setting the first argument to the TSVReader constructor to null, or by calling reader.read(true) instead of reader.read() (the second option is the only possible one if the metadata to read are embedded in the same file as the mappings themselves).

When reading, all short identifiers (aka ”CURIEs”) in the file, whether in the metadata or in the mappings themselves, are systematically expanded into their corresponding full form according to the “CURIE map” found in the metadata (and the built-in prefixes defined in the SSSOM specification). That is, the client code has only access to the expanded, full-length identifiers. The fact that the identifiers may have been stored in the file under a shortened form is considered a serialisation detail that the client code needs not know about.

The parser automatically propagates metadata from the mapping set down to the individual mappings, unless that behaviour is explicitly disabled with setPropagationEnabled(false).

The parser has a liberal behaviour, in that: a) slots that it does not recognise are silently ignored and do not cause a parsing failure; b) it aims to deal gracefully with obsolete fields, converting them to their newer equivalents whenever possible. The second point in particular means that the parser should always be able to read a file regardless of which version of the specification the file is conformant to.

By default, the parser automatically checks that the mapping set and all its mappings are compliant with the SSSOM specification. For more details on this check, see the validation section.

As a convenience, the TSV parser will automatically detect if the input file could be a file in the JSON format (if the first byte is a { character), and in that case will delegate the parsing to the JSON parser. This allows using the TSV parser in any case if you don’t know in advance whether the input file is a TSV file or a JSON file.

1.2. SSSOM/CSV format

The TSVReader class described in the previous section also supports reading files in a “SSSOM/CSV” format, which is identical to the SSSOM/TSV format except that columns are separated by commas rather than by tabs.

Note that the SSSOM/CSV format is not described by the SSSOM specification. It is supported by this implementation solely as a convenience for users, since the SSSOM reference implementation (SSOM-Py) supports it and it is already in use in the wild.

The TSVReader will try to automatically determine, when reading a file, whether it is a proper SSSOM/TSV file or really a SSSOM/CSV file. To force the reader to expect a tab-separated file, and trigger an error if the file happens to be a CSV file, call the setSeparatorMode(SeparatorMode.TAB) method; conversely, to force the reader to expect a comma-separated file, call the setSeparatorMode(SeparatorMode.COMMA) method.

1.3. JSON format

The JSON format is mostly a direct JSON rendering of the SSSOM data model. It is supported by the JSONReader class.

Note that the JSON format is currently not precisely defined by the SSSOM specification. Therefore, the behaviour of the reader may change in the future.

In particular, the specification says nothing about how the CURIE map (which may be needed to resolve CURIEs if the file makes use of them) should be stored in the JSON format. The parser supports two different ways:

  • The CURIE map is in a curie_map key in the top-level dictionary. Presumably, this is intended to become the standard way of storing the map.
  • The CURIE map is stored in a JSON-LD-like @context in the top-level dictionary. This is what SSSOM-Py produces, at least for now.

The parser is used in a similar fashion as the TSV parser:

import org.incenp.obofoundry.sssom.JSONReader;
import org.incenp.obofoundry.sssom.SSSOMFormatException;
import org.incenp.obofoundry.sssom.model.MappingSet;

try {
    JSONReader reader = new JSONReader("my-mappings.sssom.json");
    MappingSet ms = reader.read();
} catch (IOException ioe) {
    // A non-SSSOM I/O error occured
} catch (SSSOMFormatException sfe) {
    // Invalid SSSOM data
}

It also behaves mostly in the same fashion. Notably:

  • CURIEs are automatically expanded.
  • Propagatable slots are automatically propagated, again unless propagation is explicitly disabled.
  • Obsolete slots are silently converted to their newer equivalents when possible.
  • Validation of mappings is done by default.

1.4. Validation

By default, all parsers (TSV, CSV, JSON) automatically check that the mapping set and all its mappings are valid according to the SSSOM specification.

Validation checks include:

  • checking that all required slots are indeed present;
  • checking for constructs that are syntactically valid but logically forbidden (and therefore cannot be enforced at parsing time); for example, checking that a literal mapping, that is a mapping whose subject (resp. object) is a literal, has a non-empty subject_label (resp. object_label) slot.

Post-parsing validation can be controlled in two ways:

  • the setEnableValidation(boolean) method on the parser allows to disable validation completely;
  • the setValidation(ValidationLevel) method allows to more finely control how strict the checks should be.

2. Writing mappings

2.1. TSV format

Use the TSVWriter class to write a mapping set to a file:

import org.incenp.obofoundry.sssom.TSVWriter;
import org.incenp.obofoundry.sssom.model.MappingSet;

MappingSet ms = new MappingSet();
// Fill the mapping set...
try {
    TSVWriter writer = new TSVWriter("my-new-mappings.sssom.tsv");
    writer.write(ms);
} catch (IOException ioe) {
    // A non-SSSOM I/O error occured
}

When the constructor is called with a single argument, the metadata block will be written in the same file as the TSV section (“embedded mode”). Use the two-arguments variant of the constructor to write the metadata block in a separate file (“external mode”):

TSVWriter writer = new TSVWriter("my-mappings.sssom.tsv", "my-meta.yml");

When writing, the CURIE map of the mapping set will be used to try shortening all identifiers. If a given identifier cannot be shortened, it will be written under its full-length form. The built-in prefixes defined in the SSSOM specification are always used even if they are not explicitly listed in the CURIE map.

Some metadata slots are automatically condensed by the writer: if all the mappings in the set have the same value for a given slot, the value is written once in the mapping set metadata instead of being written for all mappings. This ”condensation” is the opposite operation of the “propagation” that happens upon parsing. Condensation may be disabled by calling setCondensationEnabled(false).

2.2. SSSOM/CSV format

The TSVWriter class described in the previous section can be made to produce a comma-separated file (SSSOM/CSV format) instead of a tab-separated file, by calling the enableCSV(true) method.

2.3. JSON format

Use the JSONWriter class to write a mapping set to a file:

import org.incenp.obofoundry.sssom.JSONWriter;
import org.incenp.obofoundry.sssom.model.MappingSet;

MappingSet ms = new MappingSet();
// Fill the mapping set...
try {
    JSONWriter writer = new JSONWriter("my-new-mappings.sssom.json");
    writer.write(ms);
} catch (IOException ioe) {
    // A non-SSSOM I/O error occured
}

Since the JSON format is currently unspecified, the JSON writer allows some control on some aspects of its behaviour:

  • By default, all identifiers are written in their full-length form. Call setShortenIRIs(true) to enable writing the identifiers in CURIE form.
  • When a CURIE map is needed to interpret the set (when identifiers are written in CURIE form), is is by default written in a curie_map slot. Call setWriteCurieMapInContext(true) to write the CURIE map in a JSON-LD-like @context dictionary. This is intended for compatibility with SSSOM-Py.