Your Database Is Talking; Is Anybody Listening?

Biological linguists develop new ontologies for richer,
and cross-database searches

OBO web site

By Amy Adams 
 
During most of the 1990s, a linguistic chasm divided the worlds of flies,
worms, mice, and other model organisms. People in one world remained largely
ignorant about related genes and proteins being studied in the others,
in part because each group stored data using its own peculiar vocabulary.
Even within a single organism, a search for genes involved in "translation"
might not pull up those described using the term "protein synthesis,"
and vice versa.

Michael Ashburner, a fly geneticist at Cambridge University, thought what the
genetics field needed was a universal language to bring the data together.
"It seemed to me self-evident that if all model organism databases used common
language for describing gene products, then we'd be able to have some
unification," he says.

His idea finally took hold, and in 1998 resulted in what is now the most widely
used structured language, or ontology, to describe the biological world: the
Gene Ontology (GO). Although the GO originally encompassed only fly, mouse,
and yeast data, it is now broadly used by databases for most model organisms.

Seven years later, Ashburner is helping to guide the burgeoning field through
growing pains brought on by the success of the original ontology. About 50
ontologies now comprise the Open Biomedical Ontologies (OBO) - also
administered by Ashburner - and together make up a formal way of describing
everything from human disease to animal natural history. These languages are
internally consistent, but that's not necessarily true externally, ontology to
ontology. What is needed is a way for these disparate ontologies to talk to
each other - a biological lingua franca.

AN ONTOLOGY PRIMER

Like human languages, each ontology has a slightly different structure. But,
the GO is a good model for how the languages are formed. Each term in the
ontology has an identification number and a definition. For example, the
term "aging" has the identifier GO:0016280 and the definition: "The inherent
decline over time, from the optimal fertility and viability of early maturity
that culminates in death and may be preceded by other indications such as
sterility."

The definition may include synonyms, allowing a search for "translation" to
pull up entries using the term "protein synthesis" instead, and terms are
defined as being a "cellular_component," "molecular_function," or
"biological_process." Aging is defined as a biological_process. Each term can
also be related to other terms through the relations is_a and part_of. Thus,
Aging is_a Development (GO:0007275) and is part_of Death (GO:0016265).

Other ontologies follow a similar approach, with some exceptions in the
relational terms. The Mouse Anatomy Ontology uses only part_of, whereas the
Drosophila Anatomy Ontology uses both relations from the GO plus the additional
develops_from.

The recently released Sequence Ontology (SO)[1], assembled in part by Ashburner
and Chris Mungall from the University of California, Berkeley, who has helped
develop several structured languages, takes the ontology one step further,
using the terms difference or overlap to ask questions about the part_of
relationship. The SO's primary goal was to unify gene annotations from
different genomes so that a single set of tools could search and display
results from any database. Some consistencies it cleared up include the
placement of the stop codon, which may be part of the coding sequence in one
annotation but part of the 3' untranslated region in others.

The additional terms allow the ontology to make relationships between parts of
a whole. For example, two different transcripts may be defined as part_of the
same gene. If the gene has three exons, then an exon found in both transcripts
would overlap and be part_of both transcripts. Exons found in only one or the
other transcript would be part_of the transcript and the gene, and would be a
difference between the two transcripts. This additional information, Ashburner
says, makes it possible to mine even more information from a database through
the SO.

ONTOLOGIES IN THE LAB

The end result of this linguistic tinkering is something that remains largely
invisible to the bench scientist. He or she simply knows that by entering a
search term such as Aging into FlyBase, the database will retrieve all gene
products that are defined as being involved in that process. And all the terms
are in what appears to be plain English. In fact, inclusion in the OBO requires
that the terms and their associated definitions be clear to the reader.

"I think we've found a middle ground that looks familiar to biologists, but is
more systematized," says Midori Harris, GO editor at the Wellcome Trust Genome
Campus in Cambridge, UK.

Other widely available tools, particularly those for gene expression analysis,
make use of GO terminology. For any gene that is up- or down-regulated in a
given sample, associated GO information adds context about what that gene's
product does in a cell. Likewise, it's possible to find GO terms in common
between clusters of differentially regulated genes, indicating that a
particular group of genes are all involved in cell cycle or are all expressed
in a given cellular compartment.

For now, GO is the only ontology in widespread use. But Ashburner expects the
recently released Sequence Ontology and Cell Ontology will be integrated into
tools to broaden their applications. By combining ontologies in this way, a
scientist could, in theory, identify a gene with an interesting expression
pattern, then follow that lead to related genes in other organisms via the
Sequence Ontology, to cellular pathways in the Biochemical Ontology, and to
cell type definitions within the Cell Ontology.

REWRITING THE DICTIONARY

For the moment, however, ontologies don't work together as seamlessly as they
could. Barry Smith of the University at Buffalo points out, for example, that
the GO defines Menopause as part_of Aging, and Aging as part_of Death. "If A
is part of B and B is part of C, then menopause is part of death," he says.
That relationship isn't true; certain diseases become more common after
menopause, but the process itself isn't lethal.

Another problem is that, because relational terms have slightly different
meanings in each ontology, it can be impossible to draw logical conclusions
across two or more of them. The problem, according to Smith, is that the
annotators' use of terms such as part_of and is_a isn't consistent.
"One goal of ontologies is to link different sets of data about proteins,
diseases, cell pathways, and so on. If each ontology uses the relations to
mean slightly different things, then you can't link them together," Smith says.

What's needed to bring the fields together, Smith says, is a consistent
structure for all of the biological ontologies. Smith, and a group including
Mungall, described such a unification plan in a recent Genome Biology paper[2].
This Relational Ontology (RO) includes 10 relations, each of which is rigorously
defined: is_a, part_of, located_in, contained_in, adjacent_to,
transformation_of, derives_from, preceded_by, has_participant, and has_agent.
Though ontologies already in the OBO don't need to rewrite their terms, the GO
is slowly beginning to update its terms. All new ontologies, however, must
conform to the RO as a prerequisite for inclusion in the OBO, Ashburner says.

For the bench biologist, the conversion will mean applications that allow
searches for, say, all dicistronic genes in mice, flies, and worms that are
expressed in a given cell type - a search that would currently be impossible,
says Ashburner. This potential for more powerful applications is what convinced
both the GO and OBO to adopt the Relational Ontology.

In early June, the GO held an annotation camp for experienced annotators and
people who have an interest in submitting new terms. Jim Zheng, from the
Medical University of South Carolina, attended the camp to become better
acquainted with the ontology for his own work designing an object-oriented
computer language based on GO. This work could also apply to other ontologies
and would allow programmers to design more complex applications modeling
biological processes. He explains it could take a while for annotators to
incorporate the new definitions in their work. "It is a very large task to
revamp a finished ontology, but we have to do it, and it will be an ongoing
process for a very long time," he says.

Jane Lomax, a GO editor at UC, Berkeley, who helped lead the camp, says
bringing the GO in line with the Relational Ontology will be a long-term
project. According to July 2005 statistics, the GO includes 19,247 terms:
9,960 biological_processes, 1,694 cellular_components, and 7,563
molecular_functions. But Mungall provided some help for ontology curators
by developing Obol[3]. Obol provides a way to parse ontologies, looking for
inconsistencies, redundancies, and hidden relationships. "This work should
allow us to more easily detect and fix inconsistencies in the GO - and
obviously the more consistently the relations are used in GO, the more
effective the reasoning can be," Lomax says.

Though most bench scientists will likely never know the changes that have
transpired under the hoods of their favorite databases, the results could
be big news: a closing of the chasm that once separated people working in
different fields. "I think all bench biologists should be interested in
efficiently discovering what is already known," says Ashburner.

References

1. K Eilbeck et al, "The Sequence Ontology: A tool for the unification of
genome annotations," Genome Biol 6: R44. [BioMed Central Full Text]
April 29, 2005.

2. B Smith et al, "Relations in biomedical ontologies," Genome Biol 6: R46.
[BioMed Central Full Text]  April 28, 2005.

3. C Mungall "Obol: Integrating language and meaning in bio-ontologies,"
Comp Funct Genom 2004, 5: 509-20.