Whenever I need to programmatically manipulate a OWL ontology for whatever reason, my toolkit of choice is usually Java and the OWLAPI, and I will typically write the code as a ROBOT pluggable command, so that I can delegate the boring stuff (e.g. loading the ontology from disk, parsing the command line options, etc.) to ROBOT and focus instead on the interesting bits of the task that needs to be done. This is especially useful and efficient if said task is supposed to be part of a larger ROBOT pipeline.
But when the task is supposed to be part of a larger Python-based pipeline instead, then using Java is suddenly much less practical, and I’d rather perform the task directly in Python if possible.
This raises the question of the Python library to use to manipulate ontologies. The same question in Java is a no-brainer for me because the OWLAPI is the one library to rule them all. But in Python, things are much less clear cut.
In this post, I put several ontology-related Python libraries to the test by using them to perform a simple task.
Given an ontology (ideally in any format) and a list of terms, I need to check for each term whether it corresponds exactly to the label or to an exact synonym of one of the classes in the ontology. If it corresponds to a label, then I need to get the shortened identifier (“CURIE”) of the matching class; if it corresponds to an exact synonym, then I need to get both the label and the shortened identifier of the matching class; if it doesn’t correspond to anything, I must get an “unknown term” error.
For example, if the ontology is the Drosophila Anatomy Ontology (hereafter “FBbt”) and the list of terms is as follows:
adult dorsal vessel T neuron T2 frobnicator muscle
then the expected output should be:
adult dorsal vessel ; FBbt:00003152 T neuron T2 -> T2 neuron ; FBbt:00003728 Unknown term: frobnicator muscle
because “adult dorsal vessel” is the label of the
FBbt:00003152
class, “T neuron T2” is an exact synonym for the
FBbt:00003728
class (whose label is “T2 neuron”), and there is no
such thing as a “frobnicator muscle” (at least not in Drosophila).
The first tested library is Pronto, because it happens to be the one I already knew about and had already used.
Here’s what a code performing the task described above using Pronto could look like:
from pronto import Ontology from pronto.term import Term class OntologyWrapper(): ont: Ontology terms_by_label: dict[str, Term] = {} terms_by_synonym: dict[str, Term] = {} def __init__(self, path: str, prefix: str): self.ont = Ontology(path) for term in self.ont.terms(): if not term.id.startswith(prefix): # Ignore non-FBbt terms continue if term.name is None: # Should not happen because all terms in FBbt should # have a label, but Mypy does not know that, so guarding # against the absence of label keeps Mypy happy continue self.terms_by_label[term.name] = term for synonym in [s for s in term.synonyms if s.scope == EXACT']: self.terms_by_synonym[synonym.description] = term def lookup(self, s: str) -> str: term = self.terms_by_label.get(s) if term: return f"{term.name} ; {term.id}" else: term = self.terms_by_synonym.get(s) if term: return f"{s} -> {term.name} ; {term.id}" else: return f"Unknown term: {s}" wrapper = OntologyWrapper("fbbt.obo", "FBbt:") for file in sys.argv[1:]: with open(file, "r") as f: for line in f: print(wrapper.lookup(line.strip()))'
This is pretty straightforward. We define a OntologyWrapper class to do two things:1
Pronto supports three different input formats: OBO, OBOGraph-JSON, and RDF/XML. However the best performances are obtained with the OBO format. Loading an ontology in OBOGraph-JSON or RDF/XML is, according to my test, three to five times slower than loading the same ontology in OBO (on my machine and with the latest development version of FBbt,2 the code above takes ~1.5s in OBO, versus ~5s in JSON or RDF/XML). More importantly, loading from RDF/XML causes the library to emit lots of warnings.
Unfortunately, because Pronto relies on Fastobo (see further below) for OBO parsing, it will fail to parse some OBO files that make use of syntactic constructs recently allowed in the OWLAPI OBO parser and serialiser, but that are not described in the latest available specification for the format.
It so happens that FBbt, the main ontology I work with, makes no use of such constructs (yet?), and so its OBO version is fully parseable with Fastobo and therefore with Pronto. But ontologies like the Uberon anatomy ontology and the Cell Ontology (CL) do use constructs that make them unusable with Fastobo/Pronto.
Verdict: Pronto is good when working with OBO files, with the important caveat that not all OBO files will be supported. If your files are supported, then Pronto is quite fast and has a reasonably intuitive interface.
Fastobo is a Rust library (with Python bindings) specifically intended to parse OBO files. As noted above, it is the backend used by the Pronto library when loading from OBO files.
Here’s a Fastobo version of the OntologyWrapper class we’ve seen above with Pronto:3
import fastobo from fastobo.doc import OboDoc from fastobo.term import TermFrame, NameClause, SynonymClause class OntologyWrapper: ont: OboDoc curies_by_label: dict[str, str] = {} curies_by_synonym: dict[str, str] = {} labels_by_curie: dict[str, str] = {} def __init__(self, path: str, prefix: str): self.ont = fastobo.load(path) for term in [ frame for frame in self.ont if type(frame) == TermFrame and frame.id.prefix == prefix ]: curie = term.id.prefix + ":" + term.id.local for clause in term: if type(clause) == NameClause: self.curies_by_label[clause.name] = curie self.labels_by_curie[curie] = clause.name elif type(clause) == SynonymClause: if clause.synonym.scope == "EXACT": self.curies_by_synonym[clause.synonym.desc] = curie def lookup(self, s: str) -> str: curie = self.curies_by_label.get(s) if curie: return f"{s} ; {curie}" else: curie = self.curies_by_synonym.get(s) if curie: label = self.labels_by_curie[curie] return f"{s} -> {label} ; {curie}" else: return f"Unknown term: {s}" wrapper = OntologyWrapper("fbbt.obo", "FBbt")
Compared to Pronto, Fastobo has a lower-level interface, that is much
closer to the structure of a OBO file. It helps to be familiar with that
structure (e.g. to know what “frames” and “clauses” are) to understand how the
library can be used. For example, if a class has a logical definition, what
would normally be represented as a single EquivalentClasses axiom in
a higher-level library will be represented in Fastobo as a list of
IntersectionOfClause objects, because logical definitions are
represented in the OBO format as a list of intersection_of
tags.
Performance-wise, Fastobo did not usurp its name: the library is fast. In fact, it is the fastest of all the libraries tested here. It reads FBbt in less than half a second – no other library comes under the one second mark.
Unfortunately, Fastobo suffers from the same problem as Pronto (expectedly, since Pronto uses Fastobo under the hood): it will not be able to read all OBO files, including the OBO versions of Uberon and CL.
In fairness to Fastobo, it’s not really the library’s fault. Fastobo implements the OBO format as described in the closest thing to a formal specification that the format has: the OBO Flat File Format 1.4 Syntax and Semantics. But that document has little normative value. OBO hackers don’t seem to feel particularly constrained by it (certainly less than how a OWL hacker feels compelled to follow the OWL specifications). In effect, the OBO format is practically defined by what is produced and accepted by the OBO serialiser and parser of the OWLAPI library – the de facto reference implementation of the format –, regardless of how much this deviates from the tentative specification.4
This is in fact one of the reasons I am strongly in favour of ditching the OBO format and using any of the other OWL serialisations formats (OWL Functional Syntax, Manchester Syntax, RDF/XML…) instead. Those formats are much better defined than the OBO format, which is, at its core, a hack – a hack that has served its purpose and which should now be allowed to rest in peace.
Verdict: Fastobo is good if ① you have only compatible OBO files, ② you know enough of the OBO format to be happy with the low-level interface, and ③ you need top-notch performances. I can’t really fault it for supporting only the OBO format (the way I would normally do for other libraries), since it is specifically a OBO library, not a generic ontology library.
Next up is Owlready2. Without further ado, here’s the code:
from os.path import realpath from owlready2 import get_ontology from owlready2.namespace import Ontology from owlready2.entity import ThingClass class OntologyWrapper: ont: Ontology terms_by_label: dict[str, ThingClass] = {} terms_by_synonym: dict[str, ThingClass] = {} prefix_len: int prefix_name: str def __init__(self, path: str, prefix_name: str, prefix: str): self.ont = get_ontology("file://" + realpath(path)) self.ont.load() for klass in [k for k in self.ont.classes() if k.iri.startswith(prefix)]: for label in klass.label: self.terms_by_label[label] = klass for syn in klass.hasExactSynonym: self.terms_by_synonym[syn] = klass self.prefix_len = len(prefix) self.prefix_name = prefix_name def to_curie(self, iri: str) -> str: return self.prefix_name + ":" + iri[self.prefix_len :] def lookup(self, s: str) -> str: term = self.terms_by_label.get(s) if term: curie = self.to_curie(term.iri) return f"{term.label[0]} ; {curie}" else: term = self.terms_by_synonym.get(s) if term: curie = self.to_curie(term.iri) return f"{s} -> {term.label[0]} ; {curie}" else: return f"Unknown term: {s}" wrapper = OntologyWrapper("fbbt.owl", "FBbt", "http://purl.obolibrary.org/obo/FBbt_")
Owlready2 supports reading from RDF/XML, OWL/XML, and NTriples. It does not support the OBO format (which I don’t mind – I’d rather have a library that supports RDF/XML but not OBO, rather than the other way around). More annoyingly, it does not support OWL Functional Syntax.
Regardless of the syntax, Owlready2 does not support the presence of punned entities. It is a known issue that is seemingly going to be left unfixed on purpose. This unfortunately means the library cannot be used to work with a standard release of CL, which does contain a few punned entities.
Also, because it is a OWL library and not a OBO library, it has – expectedly – no built-in concept of “CURIE”. Entities are always only identified by their full-length IRI; if you need CURIEs for some reason, you have to shorten the identifiers yourself (as the code above is doing). This may come as an annoyance for the OBO folks, but I personally don’t mind – in fact, I prefer that way.
However, I do dislike the way annotations are made accessible in the
interface, by making them appear as built-in attributes of the object
representing the entity, with a name derived from the name of the annotation
property (as in klass.hasExactSynonym
to access an annotation
with the
http://www.geneontology.org/formats/oboInOwl#hasExactSynonym
property). From experience, this kind of syntactic sugar regularly ends up
doing more harm than good. For example, what would happen if the ontology had
two different annotation properties with an identical local name, but in two
different namespaces? I did the test with a class carrying both a
http://www.geneontology.org/formats/oboInOwl#hasExactSynonym
annotation and a https://example.org/hasExactSynonym
annotation: klass.hasExactSynonym
only returns the latter
annotation value, and I have no idea on how to get the former, or if it is
even possible. (Granted, I do not have any real ontology where this happens,
but I don’t think this is a far-fetched scenario.)
Performance-wise, Owlready2 reads the RDF/XML version of FBbt in about 7 seconds on my machine, which is not too bad (some other libraries are far worse) but is at the upper range of what I am willing to accept.
Verdict: Owlready2 is good if ① you are not working with OBO files, ② you need not worry about the presence of punned entities (lucky you!), ③ you do not mind the quirks of the interface. For my part I will avoid it, since I do mind those quirks as explained above and I do have punned entities in at least some of the ontologies that I work with.
PyOBO leaves me a very mixed feeling.
On the surface, that library seemingly offers the easiest way to do exactly what I need, with the pyobo.ground high-level function:
>>> pyobo.ground("fbbt", "T neuron T2") NormalizedNamableReference(prefix="fbbt", identifier="00003728", name="T2 neuron")
Nice, right?
Well, except for one thing: this function is automatically using the latest published version of FBbt, downloaded from the Internet if it is not already in PyOBO’s local cache. That’s great if you are a user of FBbt, but in my case I edit the ontology and most of the time, I want to work with the “development” version of the ontology – the version that is checked out locally on my computer and that may contain dozens, sometimes hundreds, of edits that have not yet found their way to a published version.
Unfortunately, there doesn’t seem to be a way to force pyobo.ground to use a local version of an ontology. I was hoping it would be possible to manually load an ontology from file and put it into PyOBO’s cache, like this:
>>> fbbt = pyobo.from_obo_path("fbbt.obo", prefix="FBbt", version="dev") >>> fbbt.write_cache()
But while this does write a bunch of files in the cache, this has no effect on pyobo.ground, which will always attempt to download FBbt from the Internet no matter what is in the cache. At this point I am not sure if it’s a bug, or if I what I’m trying to do is simply not a supported use case – it certainly doesn’t look like a supported use case according to what little documentation is available.
So, I have to forget about pyobo.ground, and more generally almost all high-level functions in pyobo (which all have the same tendency of always wanting to download a fresh version from the Internet), and instead work with the ontology object returned from pyobo.from_obo_path – an object whose methods will at least use the data read from the provided file.
Here’s then the PyOBO version of the OntologyWrapper class:
from pyobo import get_ontology, from_obo_path, Obo, Term class OntologyWrapper: ont: Obo prefix: str terms_by_label: dict[str, Term] = {} terms_by_synonym: dict[str, Term] = {} def __init__(self, path: str, prefix: str): self.ont = from_obo_path(path, prefix, version="dev") self.prefix = prefix for term in self.ont.iter_terms(): self.terms_by_label[term.name] = term for synonym in [s for s in term.synonyms if s.specificity == "EXACT"]: self.terms_by_synonym[synonym.name] = term def lookup(self, s: str) -> str: term = self.terms_by_label.get(s) if term: return f"{term.name} ; {self.prefix}:{term.identifier}" else: term = self.terms_by_synonym.get(s) if term: return f"{s} -> {term.name} ; {self.prefix}:{term.identifier}" else: return f"Unknown term: {s}" wrapper = OntologyWrapper("fbbt.obo", "FBbt")
The interface overall is very similary to that of other libraries like Pronto. But the performances are painfully bad: it takes 45 seconds to read FBbt, which makes PyOBO the slowest of the OBO parsers I have tested by a large margin.
Verdict: I suppose PyOBO must be very useful for people who just want to use informations from OBO ontologies without ever worrying about how said informations are obtained – in fact all the examples given in the project’s README are doing just that. But it is clearly not intended to allow working efficiently with local ontologies, which happens to be what I need to to the most.
Then comes FunOWL, which seems much more useful for building ontologies than for querying one – as readily acknowledged on the project’s README: the library is firstly intended as a generator, and only secondarily as a consummer. As far as I can tell, the library offers no function to, say, get a given class in an ontology, or even to get all classes. Once an ontology has been parsed, all we can get from it is the complete list of axioms, that we must swift through to obtain the information we need. So it’s a very low-level interface, a bit like Fastobo for OBO files.
Here’s what the OntologyWrapper class could look like with FunOWL:
from funowl import AnnotationAssertion, OntologyDocument from funowl.converters.functional_converter import to_python class OntologyWrapper: ont: OntologyDocument curies_by_label: dict[str, str] = {} curies_by_synonym: dict[str, str] = {} labels_by_curie: dict[str, str] = {} def __init__(self, path: str, prefix: str): self.ont = to_python(path) for axiom in [ ax for ax in self.ont.ontology.axioms if type(ax) == AnnotationAssertion and ax.subject.v.v.startswith(prefix) ]: curie = self.to_curie(axiom.subject.v.v) if axiom.property.v == "rdfs:label": label = axiom.value.v.v self.curies_by_label[label] = curie self.labels_by_curie[curie] = label elif axiom.property.v == "oboInOwl:hasExactSynonym": self.curies_by_synonym[axiom.value.v.v] = curie def to_curie(self, s: str) -> str: return ":".join(s.split(":", 1)[1].split("_", 1)) def lookup(self, s: str) -> str: curie = self.curies_by_label.get(s) if curie: return f"{s} ; {curie}" else: curie = self.curies_by_synonym.get(s) if curie: label = self.labels_by_curie[curie] return f"{s} -> {label} ; {curie}" else: return f"Unknown term: {s}" wrapper = OntologyWrapper("fbbt.ofn", "obo:FBbt_")
Alas, FunOWL’s parsing performances are horrendously bad: it reads the Functional Syntax version of FBbt in… 3 minutes and 30 seconds! Also, it fails to parse the latest version of CL for some reason (I didn’t investigate why, since at this point it was clear I was not going to use the library anyway; some uncaught AttributeError).
Verdict: I will make no judgment on FunOWL’s usefulness as a OWL generator (its primary intended use), as I did not test that aspect of the library at all (it’s not something that I often need to do, or rather not something I often need to do while working in Python). But as a Functional Syntax parser, it is practically unusable.
Py-Horned-OWL is a set of Python bindings for the Horned-OWL Rust library.
As usual, here is the Py-Horned-OWL version of the OntologyWrapper class:
from pyhornedowl import open_ontology, PyIndexedOntology class OntologyWrapper: ont: PyIndexedOntology iris_by_label: dict[str, str] = {} iris_by_synonym: dict[str, str] = {} labels_by_iri: dict[str, str] = {} def __init__(self, path: str, prefix_name: str, prefix: str): self.ont = open_ontology(path) self.ont.add_prefix_mapping(prefix_name, prefix) self.ont.add_prefix_mapping("rdfs", "http://www.w3.org/2000/01/rdf-schema#") self.ont.add_prefix_mapping("oio", "http://www.geneontology.org/formats/oboInOwl#") self.ont.build_indexes() for klass in [c for c in self.ont.get_classes() if c.startswith(prefix)]: for label in self.ont.get_annotations(klass, "rdfs:label"): self.iris_by_label[label] = klass self.labels_by_iri[klass] = label for synonym in self.ont.get_annotations(klass, "oio:hasExactSynonym"): self.iris_by_synonym[synonym] = klass def lookup(self, s: str) -> str: klass = self.iris_by_label.get(s) if klass: curie = self.ont.get_id_for_iri(klass) return f"{s} ; {curie}" else: klass = self.iris_by_synonym.get(s) if klass: curie = self.ont.get_id_for_iri(klass) label = self.labels_by_iri[klass] return f"{s} -> {label} ; {curie}" else: return f"Unknown term: {s}" wrapper = OntologyWrapper("fbbt.owl", "FBbt", "http://purl.obolibrary.org/obo/FBbt_")
The library supports the RDF/XML, OWL/XML, and Functional Syntax formats, can read without a glitch all the ontologies I routinely work with, and does so with very decent performances (about 2 seconds to read the RDF/XML version of FBbt).
Its interface is somewhat reminiscent of that of the Java OWLAPI, for
example in that the annotations for a given entity must be obtained from the
top-level ontology object, rather than from an object representing the entity
itself (as can be seen in the wrapper above:
ont.get_annotations(klass...)
, rather than
klass.get_annotations(...)
). I suspect many OBO folks might not
like that, and would prefer a Pronto- or Owlready2-like interface, but I
personally don’t mind at all. In fact, for someone used to the OWLAPI,
Py-Horned-OWL’s interface has a nice “homey” feeling.
Verdict: Py-Horned-OWL is great if you ① are working with any kind of OWL files, ② have no need for OBO support, and ③ are a transfuge from the land of Java and you miss the OWLAPI. ;) In fact, Py-Horned-OWL is the closest thing to the ideal Python OWL library that I have been looking for – I only wished I had discovered it sooner.
Finally, there’s the Ontology Access Kit (OAK).
This one is a bit special in that it does not provide its own parsers (apart from the “simpleobo” OBO parser), but instead aims to provide a common, high-level interface to several underlying libraries and their parsers.
Here’s the OAK-based OntologyWrapper class:
from oaklib import get_adapter from oaklib.datamodels.search import SearchConfiguration, SearchProperty from oaklib.interfaces import SearchInterface class OntologyWrapper: ont: SearchInterface cfg: SearchConfiguration = SearchConfiguration( properties=[SearchProperty.LABEL, SearchProperty.ALIAS] ) prefix: str def __init__(self, selector: str, prefix: str): self.ont = get_adapter(selector) self.prefix = prefix def lookup(self, s: str) -> str: found = [ c for c in self.ont.basic_search(s, config=self.cfg) if c.startswith(self.prefix) ] if not found: return f"Unknown term: {s}" found = found[0] label = self.ont.label(found) if label == s: return f"{label} ; {found}" else: return f"{s} -> {label} ; {found}" wrapper = OntologyWrapper("pronto:fbbt.obo", "FBbt")
It is significantly different from all the other versions, since the OAK high-level interface dispenses us from having to do most of the heavy lifting.
We can’t really discuss the performances and limitations (e.g. in terms of supported formats) of the Ontology Access Kit itself, since they are almost entirely dependent on the underlying library used under the hood.
In the example above, the backend library is Pronto, and we thus inherit the aforementioned limitations of that library, including the inability to parse some OBO files (such as recent versions of CL) or slower performances when parsing from a RDF/XML file compared to an OBO file.
The best performances, sometimes by a large margin, are obtained with the
sqlite backend (which seems clearly intended as the “primary”
backend), which requires that the ontology to work with must be converted to
the SemSQL format
first. Assuming a fbbt.owl
file exists in the current directory,
this can be done with:
$ semsql make fbbt.db
On my machine, this takes approximately 3 minutes, and yields a SQLite file of about 2.5GB (from a 112MB RDF/XML file). But this only needs to be done once (at least as long as the ontology doesn’t change), and is really the only way to ① load any ontology (bypassing all the limitations of the other backends) and ② get decent performances.
You can also let OAK automatically download and cache pre-built SemSQL versions of the most common OBO ontologies, if you do not want to do that yourself. But in my case, as explained above when discussing PyOBO, this is not an option as I need to work with my local version of FBbt, not the latest published version.)
Verdict: The OAK is good if ① you want/need the high-level interface or features and ② you do not mind having to generate SemSQL versions of your ontologies.
For completeness, let us try RDFLib. It is not, strictly speaking, a library for manipulating ontologies, but since a OWL ontology can be represented as a RDF graph, it can also be manipulated as one.
Its use is very similar to FunOWL above, except that instead of swifting through axioms, we have to swift through triples – which in effect does not make a great difference for the use case considered here, since we are only interested in annotation assertion axioms and each one of those corresponds to exactly one triple.
Here’s the RDFLib-based version of the OntologyWrapper class:
from typing import ClassVar from rdflib import Graph, URIRef class OntologyWrapper: ont: Graph iris_by_label: dict[str, str] = {} iris_by_synonym: dict[str, str] = {} labels_by_iri: dict[str, str] = {} prefix_name: str prefix_len: int LABEL: ClassVar = URIRef("http://www.w3.org/2000/01/rdf-schema#label") SYNONYM: ClassVar = URIRef("http://www.geneontology.org/formats/oboInOwl#hasExactSynonym") def __init__(self, path: str, prefix_name: str, prefix: str): self.ont = Graph().parse(path) self.prefix_name = prefix_name self.prefix_len = len(prefix) for subject in [ s for s in self.ont.subjects(unique=True) if type(s) == URIRef and s.startswith(prefix) ]: for label in self.ont.objects(subject=subject, predicate=self.LABEL): self.iris_by_label[str(label)] = str(subject) self.labels_by_iri[str(subject)] = str(label) for synonym in self.ont.objects(subject=subject, predicate=self.SYNONYM): self.iris_by_synonym[str(synonym)] = str(subject) def to_curie(self, iri: str) -> str: return self.prefix_name + ":" + iri[self.prefix_len :] def lookup(self, s: str) -> str: iri = self.iris_by_label.get(s) if iri: curie = self.to_curie(iri) return f"{s} ; {curie}" else: iri = self.iris_by_synonym.get(s) if iri: label = self.labels_by_iri[iri] curie = self.to_curie(iri) return f"{s} -> {label} ; {curie}" else: return f"Unknown term: {s}" wrapper = OntologyWrapper("fbbt.owl", "FBbt", "http://purl.obolibrary.org/obo/FBbt_")
This works for all the ontologies I need to work with, but performance-wise, is a bit too slow for my taste (about 20 seconds to parse FBbt). Still, it would have been a reasonable fallback to work with CL or Uberon if Py-Horned-OWL had not been available.
You can add a comment by replying to this message on the Fediverse.