|Nature Reviews Genetics 5, 213-222 (2004); doi:10.1038/nrg1295
ONTOLOGIES IN BIOLOGY: DESIGN, APPLICATIONS AND FUTURE CHALLENGES
Jonathan B. L. Bard1 & Seung Y. Rhee2 about the authors
1 Bioinformatics, Biomedical Sciences University of Edinburgh, Edinburgh EH8 9XD, UK.
2 Plant Biology, Carnegie Institution of Washington, Stanford, California 94305, USA.
Biological knowledge is inherently complex and so cannot readily be integrated into existing databases of molecular (for example, sequence) data. An ontology is a formal way of representing knowledge in which concepts are described both by their meaning and their relationship to each other. Unique identifiers that are associated with each concept in biological ontologies (bio-ontologies) can be used for linking to and querying molecular databases. This article reviews the principal bio-ontologies and the current issues in their design and development: these include the ability to query across databases and the problems of constructing ontologies that describe complex knowledge, such as phenotypes.
"If the nineteenth century was the century of chemistry and the twentieth century the century of physics, the twenty-first century promises to be the century of biology"1. Determining factors in the success of the biological sciences have been the advances in technology and communications: these have enabled data to be generated in a high-throughput manner and to be distributed to scientists across the globe. Until recently, the most important task of bioinformatics was thought to be the storage, retrieval and analysis of molecular data, such as nucleotide sequences and protein structures2. However, as experimental technologies move from producing relatively simple data, such as nucleotide sequences, to more complex data, such as that for microarray results, images and molecular interactions, we need comparable advances in bioinformatics to manage and relate these data.
There is also a great deal of sophisticated biological knowledge, often hierarchical in nature, that needs to be integrated with molecular data: obvious examples include anatomies, signal-transduction pathways and, of particular current importance, phenotypic data. One way to do this is to represent such biological knowledge as ontologies: the resulting 'bio-ontologies' are formal representations of areas of knowledge in which the essential terms are combined with structuring rules that describe the relationship between the terms. Knowledge that is structured within a bio-ontology can then be linked to the molecular databases.
This review aims to cover the essential features of bio-ontologies, a relatively new area of bioinformatics. It deals only briefly with the formal study of ontologies in computer science, as this is an established and well-documented field3. We discuss a few bio-ontologies that are in use, focusing on key examples that will be relevant for the description of phenotypes. We then consider several important issues for the development of bio-ontologies, such as the production of ontologies for complex areas of knowledge (including phenotypic data), the ability to query across databases and the use of ontologies to analyse large data sets. The review concludes with a brief discussion of what the field can expect in the near future.
It is first worth pointing out that, for any ontology to be of public value, it has to be widely disseminated and accepted by the field that it aims to summarize. Sociological factors are important in ontology production and acceptance, and a strong community involvement is crucial to ensure that only single ontologies for each area are placed in the public domain. In this respect, the important standard is the Open Biological Ontologies (OBO) web site, in which many bio-ontologies are archived in a standard format (Table 1).
Table 1 | Some principal biological ontologies and other web sites
Although there are more technical definitions, here we can consider an ontology to be an area of knowledge that is formalized, such that the individual terms (or concepts) are defined by a set of assertions that connect them to other terms. In an anatomy ontology, for example, the developing humerus might be defined as: part of (in the sense of a component piece of) the arm; has cell type osteoblast; has adhesion points for muscles; and is-a bone. Note that the terms do not represent an individual item but the associated set — that is, not the particular humerus of Eve Smith, but all humeri. As well as being described by their relationships, terms in an ontology also contain a unique identifier (ID) (such as GO:0019505), a name ('resorcinol metabolism', for example), a textual definition (such as 'the chemical reactions and physical changes involving resorcinol (C6H4(OH)2), a benzene derivative with many applications (including dyes, explosives, resins and as an antiseptic)') and synonyms (such as '1,3-benzenediol metabolism' or '1,3-dihydroxybenzene metabolism').
Ontologies are different from annotations (descriptions of data objects) in that they formalize the meaning of terms through a set of assertions and rules that are collectively known as a 'description logic'. An advantage of ontologies is that the description logic can be used both for querying an information set and for facilitating analyses across information sets that are not traditionally accessible to searching and comparing. If, for example, a database stores gene-expression data for the mouse forelimb skeleton under its individual parts (the humerus, radius, ulna, carpal bones, and so on), then a query on gene expression in the forelimb skeleton can automatically use the part of relationship to identify the constituent tissues, search on their database entries for expression and then combine them to list all the genes that are expressed in the forelimb skeleton. Furthermore, the structure of ontologies can be represented and viewed as graphs (see Box 1 for more details).
Ontology IDs. Each term in the ontologies that are associated with the OBO has an ID that has two components: a letter code that specifies the ontology type and a number. For example, CL:0000188 represents a skeletal muscle cell in the Cell Ontology (OBO): the ontology type is defined by the prefix CL and the number represents a unique entity in the CL ontology. IDs can be used in two ways: to link a biological database to ontologies and to connect different biological databases (interoperability). If a database, such as a sequence repository, associates its data objects with ontology IDs, a user can query the database for data that is associated with a particular ontology ID and also use the logic of the rules in the ontology to ask further questions about the associated data. Ontology IDs can also be used to allow one database to query another directly. If different biological databases use the same ontologies to describe their data objects, the ontology IDs can be used as the currency with which associated data in individual databases can be retrieved4.
Examples of bio-ontologies
Ontologies have been used in biology for some time, although they have not necessarily been recognized as such. Indeed, the field of systematics could be considered to be a classic example of ontologies in biology. An important large-scale example for molecular biology is the Unified Medical Language System (UMLS) (see Table 1) hierarchy of terms and relationships, which is used for searching PubMed and many other resources. A typical, small-scale example of an ontology would be the controlled vocabulary for sexual phenotype, in which male, female and hermaphrodite are the three obvious classes of gender (the is-a relationship). Here, we focus on three ontologies that should be instrumental in describing phenotypic data.
Gene Ontology. The Gene Ontology (GO) is by far the most widely used bio-ontology. It aims to formalize our knowledge about biological processes (Fig. 1), molecular functions and cell components, in three orthogonal (mutually independent) hierarchies. The GO has reached a substantial size, containing approximately 16,500 terms, with the nodes and leaves within each hierarchy being connected by is-a or part of relationships. As the terms can have more than one parent, the structure is represented graphically as a directed acyclic graph (DAG, see Box 1b).
A practical importance of the GO is that it is linked to a database of more than 120,000 gene products from almost 20 experimental organisms — including animals, plants, fungi, bacteria and viruses — in which the proteins are tagged with GO IDs. This means that a user can identify both the proteins associated with a specific GO term (there are, for example, 16 proteins associated with virus–host interactions) and all of the GO terms associated with a given protein by using an appropriate browser, such as AmiGO (see Table 1). For each gene, the user is directed to the database that contributed the annotation to find more detailed information about that gene. This type of infrastructure provides a simple way of traversing between areas of knowledge.
Anatomical ontologies. These ontologies include the supracellular physical structures that make up a particular organism. They can be organized using appropriate rules, such as relative location (the atrium is part of the heart), lineage (the gut is derived from the endoderm) and class (the cardiovascular system is-a organ system).
Although it might seem that making such ontologies is straightforward, the handling of anatomies actually highlights some interesting problems. For example, it is worth considering the requirements for two very different users interested in human anatomy. A surgeon will want an ontology to specify those tissues he might have to cut through if he enters the body from a particular angle to operate on the diaphragm. This ontology would have to include both tissues and their spatial relationships (such as next to) and can be self-contained. A developmental biologist who wants to identify the genes associated with a particular tissue at a particular developmental stage would require an ontology of tissue names (ordered by a part-of rule) that were linked to a database that contains gene-expression data, and spatial relationships might not be essential.
For human clinical anatomy, two sophisticated ontology frameworks are available: Galen and the Digital Anatomist (Table 1). Both handle geographical and other knowledge about adult tissues and include many thousands of terms in a wide variety of relationships. There are, for example, several relationships that can be subsumed under part of: the coronoid process is a physical component of the ulna bone, marrow is contained within the ulna bone and the pancreas is a member of the glandular system (these types of relationship are the subject of an area of logic called 'mereology'5; see Box 1). Many such relationships are included in both of these ontologies, which set out to be comprehensive. However, they are not always easy to use.
By contrast, the developmental biologist will probably find it useful to consult the ontology of Human Developmental Anatomy, which is modelled on the Mouse Anatomy Ontology (see Table 1) and is designed for archiving gene-expression data on human embryos in their first seven weeks of development. It has several thousand tissues that are linked by a simple part-of rule that usually means is a physical component of. This ontology is intuitive to use but includes no spatial rules; its use is therefore limited to handling tissue-associated data and to defining the tissues that are present at a given developmental stage.
The OBO web site provides access to a further ten anatomies for common plants and animals, with all using part-of, is-a and, in some cases, is-derived-from relationships. All are embedded in the core databases for their species, and several are now linked to gene-expression data.
Cell Ontology. This new and still unfinished ontology is being designed to provide all model species with a common language and ID set for cell phenotypes. Its production illustrates some of the core aspects of ontology design. First, its conceptual framework required an analysis of the contexts in which cells are used and described (morphology, function, species, and so on) and the sorts of relationships required (is-a and is-derived-from). Second, in attempting to make it useful for all organisms, ubiquitous cell types are at a much higher level in the hierarchy than those restricted to specific families of organisms. Third, it required the garnering of data from a range of standard textbooks. Fourth, it involved people from widely different fields freely giving expert knowledge: Michael Ashburner and J.B.L.B. initiated the ontology and provided invertebrate and vertebrate data, whereas David States and S.Y.R. provided blood-cell and plant data, respectively. The prototype ontology is now publicly available for comment and is being improved by the community. In the end, the ontology will reflect the expertise of the community that uses it rather than that of any individual and it will be available for anyone who wishes to code cell-type identities in a standard way.
Creating and displaying ontologies
There are several tools for editing and viewing ontologies (Table 1). To the user with the appropriate viewing software, an ontology appears as a tree or network (Fig. 1). However, the underlying textual syntax is far more opaque, as an inspection of any ontology in a text editor rapidly makes clear. Ontologies can be written in either frames or flat files. Programs such as Protégé or Ontolingua generate a separate frame (or page) for each knowledge item that contains information such as its links, definitions and its relationships to other items. By contrast, flat files include knowledge of all the items of an ontology within a single file, and are the more commonly used format. Flat files can be written in a range of functionally similar formats, such as OWL, GO, RDF and XML, by using ontology editors (see Table 1). The GO flat file, for example, provides a common format that can be edited and viewed using several editors: the standard editor for GO is DAG-edit, which allows a user to create, edit and view an ontology (Fig. 1). A recently developed tool, COBrA, can also translate one format into another and allows a user to make links between two ontologies.
Bio-ontologies now have a wide variety of uses, the most important of which is the representation of knowledge in a computer-comprehensible way, interoperability across databases, and the annotation and analysis of large-scale data with ontology IDs. Here, we consider each of these roles, starting with how the field is approaching the representation of complex knowledge in which a single, simple ontology is inadequate.
Handling complex areas of knowledge — the phenotype example. Most bio-ontologies are relatively simple in that they describe the essential features of well-defined and local domains of knowledge. However, there are complex areas of knowledge — a challenging example being the description of mutant phenotypes — that cannot easily be described in this way. 'Phenotype' can be defined as the observable and measurable characteristics of an organism, which result from the interaction of the organism's genetic 'blueprint' (its genotype) and the environment. Phenotype information is currently described as free-text in most biological databases6-8, although efforts have been made to store the information in more structured ways9, 10. Free-text, database-specific phenotypic descriptions cannot be queried and compared easily, especially if they lie outside a researcher's immediate research focus.
Phenotypic descriptions can be handled using ontologies in several ways. The first and most straightforward is to make a dedicated ontology specific for an organism. The Mouse Genome Database (the Jackson Laboratory, USA; see Table 1) has taken this approach for the mouse, and its ontology is being used to code phenotypes of mutant mice11. One caveat with this approach is that terms might be needed to represent all of the variety of phenotypes under different conditions. This could result in a rapid increase in the size of the ontology, making it laborious to maintain. Another problem lies in the difficulty of extending the mouse phenotype ontology to any other organism.
The second approach to describing phenotype is to make a composite annotation using several simpler ontologies. We can deconstruct a phenotype into three independent components: the observable character, the trait and the experimental condition. Each of these components can then be described using one or more ontologies (Fig. 2). The observable character or quantity can be anything for which characteristics can be observed or measured. Examples include an anatomical structure of an organism, such as a stem or a metabolite, the amount of which can be quantified. Some existing ontologies, such as GO or the Cell, Anatomy and Biochemical substance ontologies at OBO, can be used to describe such an observable character. The trait is the attribute or characteristic that is being measured; examples of traits include height, weight, viability and enzymatic activity. In addition to the GO function ontology, there are a couple of ontologies for trait (in development phase) at OBO, which can be found under the Attribute_and_value and Plant trait categories. Finally, the experimental condition under which the trait is measured can be described by considering the assay method and the environmental and genotypic conditions under which the measurement is taken. There are several ontologies that deal with experimental methods at OBO, and these fall under Experimental methods and a preliminary Environmental condition ontology. Fig. 2 shows how the 'dwarf plant' phenotype can be translated into a plant that has 'short height' (Trait ontology) of the stem (Anatomy ontology) that is assayed using a 'ruler measurement' (Assay ontology).
Making composite annotations using these ontologies requires that the relationship context of each ontology to the data object (for example, having a trait, of a tissue, measured by an assay) be included. Table 2 illustrates some simple examples of composite annotations to describe a gene's expression patterns and biological roles using multiple ontologies. It should be pointed out that the difficulty with this approach is that assigning a unique code for a mutant phenotype becomes impossible and the code is actually the set of IDs from the relevant ontologies. This set is, however, easy to search.
Table 2 | Annotations of Arabidopsis genes using multiple ontologies
Several databases have used multiple ontologies to describe complex information (see Table 1). For example, Pathbase is a database of mouse pathological images that uses separate anatomy, pathology, cell and other ontology IDs to access the relevant image12. MetaCyc, which handles metabolic pathways13, has been available for several years. It uses ontologies for metabolic pathways, reactions, compounds and cellular components to describe metabolism, but does not yet link metabolism with anatomy and developmental stages. A further example is PharmGKB, which handles PHARMACOGENETIC information14. PharmGKB aims to represent the relationship between genotype and phenotype for drug response in humans, but its individual component ontologies are not yet enumerated in detail. Although this approach has been used to describe complex information and is being used to describe phenotype, powerful query and analysis tools that take advantage of such structured knowledge have yet to come.
The third, and still new, way of handling complex data, such as phenotype, is to combine terms in multiple, orthogonal ontologies to create a single new ontology15. The process of heart development can, for example, be described as a combination of the relevant terms in the anatomy (for example, heart) and the GO process (such as development) ontologies. Although this set of joint terms might provide some novel concepts that are worth investigating, it suffers from the problem that many of the terms might not be biologically valid (for example, 'heart' plus 'photosynthesis'). Deciding which terms should be excluded in the primary ontologies to make the cross-product ontology can be time-consuming, to the extent that if more than two ontologies are needed for the description, the task of examining and validating all the cross products will simply become impractical.
Much work will be needed to optimize the way in which ontologies handle complexity such as phenotypes. A good solution will ensure that implementation will be organism-independent as much as possible to facilitate interoperability across databases. During the past two years, the curators of approximately 15 biological databases (see online link box) have met to discuss the problem of representing phenotype information. This resulted not only in identifying the issues at hand for each community, but also in defining the ontologies that are needed for interoperability and for describing phenotype robustly (Fig. 2). Such a forum will be crucial to the success of handling complex information across many biological databases.
Interoperability. Interoperability, or the querying of one database by another, is becoming increasingly important. Ontologies are beginning to be valuable here through the use of the unique IDs that are associated with each of their terms. A simple example typifies the importance of this: the Edinburgh Mouse Atlas Project (EMAP) web site (Edinburgh, UK) includes two-dimensional section sets of many three-dimensional models of early mouse development in which the individual tissues have been delineated and assigned EMAP IDs (in essence, this is a graphical ontology in which the knowledge is visual rather than textual). In the user interface, a tissue is highlighted when the cursor reaches it and a click of the mouse sends the ID of that tissue to GXD, the Mouse Gene Expression Database (Jackson Laboratory, USA), as a query. As GXD uses the same anatomy IDs as EMAP, it responds to the query by producing a table of all the genes that are expressed in that tissue and returns them to the user's screen through EMAP (Fig. 3). GXD also carries GO IDs and, therefore, searches can also be made on the basis of gene annotations to GO terms. For example, the query can be given as 'return only those genes expressed in the developing heart and have transcription-factor activities'.
Exploring large data sets. An important application of bio-ontologies is their use in investigating gene function. Several biological databases now use the GO terms to assign functions, biological roles and sub-cellular locations of proteins. These annotations can be used in combination with sequence-similarity analysis to infer the function, role and location of proteins in, say, agronomically important animals and plants, even when their genomes have not been fully sequenced. For example, to identify candidate genes that correspond to QUANTITATIVE TRAIT LOCI in swine and cattle, Harhay and Keele annotated expressed sequence tags (ESTs) from swine and cattle with the GO terms by sequence comparison with model species for which genomes have been annotated with GO16. Similarly, groups of annotated genes can be compared to determine over- or under-representation of the annotated terms. Several bioinformatics tools, such as GeneCensus17, OntoExpress18 and TermFinder19 (see Table 1), can compare the statistical significance of the representation of GO terms between two sets of genes (for example, from a pair of expression clusters identified from microarray analysis). Furthermore, annotations of genes using ontologies can lead to the development of algorithms that can use these annotations to predict function20.
Mapping knowledge domains. Different ontologies can be mapped to each other and these links provide hooks from one expert domain of knowledge to another, thereby creating an ontology network that allows a user working in one area to take advantage of knowledge from a related area21-23 (see Table 1). For example, Bodenreider and colleagues mapped UMLS, a highly specialized medical ontology, onto WordNet, an electronic lexical database for the English language, in an attempt to identify an overlap between the two24. Their work shows how the knowledge domains of two different types of community — medical specialists and the general public — can be linked. For instance, a patient can search for a common disease name in WordNet and then be linked to the comparable term in UMLS and therefore to the details of the disease. More advances on such mapping analysis between ontologies using, for example, SEMANTICS, or even NATURAL LANGUAGE PROCESSING, could be a key factor in closing the distances between different experts, disciplines and even sociological boundaries. At a more biological level, the XSPAN project seeks to make mappings across the anatomies of the main model organisms on the basis of cell type, homology and analogy. These links could be useful in identifying related mutant phenotypes in different model organisms.
Once the ontologies are networked and data objects such as genes are annotated to the ontologies, we can start to ask questions about the genes that are involved in, say, a process such as the tricarboxylic acid (TCA) cycle in Escherichia coli, Arabidopsis, Drosophila and humans. We can, for example, address how the TCA cycle has evolved on the basis of the properties (and therefore mechanisms) of its constituent proteins. We might also be able to apply ontological approaches to address systematics, taxonomy and evolution. It is, for example, known that there was a large radiation of flowering plants (angiosperms) approximately 150 million years ago that diversified the morphology of flowering plants25 (Fig. 4). When genes from a wide range of flowering plants are annotated to angiosperm anatomy and developmental stages ontologies, it will be possible to perform a systematic analysis of the genetic changes associated with taxonomic diversification. Recently initiated projects — such as the Plant Ontology Consortium (see Table 1), which attempts to develop a unified anatomy and developmental stages ontologies for angiosperms, and the Floral Genome Project, which attempts to identify genes involved in floral development in angiosperms — will facilitate such analyses.
Figure 4 | Diversity of floral morphology in angiosperms.
a | Antirrhinum filipes. b | Illicium floridanum (sexual organs only). c | Houttuynia cordata filled cultivar. d | Acorus calamus s.S. e | Aponogeton distachyos. f | Tasmania moorei male. g | Iris japonica (petaloid stigma). h | Amborella trichopoda male. i | Illicium floridanum. j | Amborella trichopoda female. k | Nymphaea hybrida var. escarbuncle. l | Tasmania moorei female. Images courtesy of M. Buzgo, University of Florida, USA.
Future projects, prospects and challenges
Bio-ontologies for the obvious knowledge domains are now in place and are under active curation. Attention is beginning to be focused on ontologies that describe in vivo cell imaging, molecular interactions and data that are linked to space rather than text26. For example, as gene-expression domains might not be restricted to tissue boundaries, it is better to represent them as volume units (VOXELS) in a three-dimensional model of the anatomy (see EMAP web site). As more knowledge of genetic networks is gleaned, we can also look forward to an ontology of signalling pathways. In addition, to fully correlate between phenotype and genotype, a systematic and standard way of describing genotype will be needed.
It cannot be emphasized too strongly, however, that the key to the general use of ontologies will be access to the data in biological databases that are annotated with the knowledge in these ontologies. Many biological databases are now incorporating ontology IDs (particularly those from the GO) and using them to annotate data objects. The more that this is done, the more useful these resources will be for the community. Unfortunately, most of the current search and analysis tools for mining these data are not as powerful as might be liked. However, analysis of the ontologies using GRAPH THEORETICAL APPROACHES27, 28 might provide interesting insights about the representation of knowledge. We hope that more tools will be produced that exploit the ontologies and their associations with data objects.
A difficult problem that the field has yet to confront is how to deal with a term that is represented in several, possibly overlapping, ontologies. For example, MetaCyc contains a term for which the ID is 'NAD BIOSYNTHESIS III'. This term is synonymous to GO:0019360 in GO, which corresponds to nicotinamide nucleotide biosynthesis from niacinamide. A term having several apparently unique IDs cannot be fully interoperable on the basis of any one of them. GO provides a mapping of different ontologies to GO, but this is mostly a manual effort and keeping it updated is a major challenge. There is no easy answer to this problem, but one possibility is that the OBO (or a similar site) could hold a look-up table for all IDs and their alternatives that can be accessed automatically. This will of course only be achieved if there are both funding and a communal agreement to share codes, and even then, appropriate software needs to be implemented before such a system would itself be fully interoperable.
It is nevertheless reasonable to expect that the development of ontologies, annotation of data objects using the ontologies and sophisticated search tools should enable us to start to systematically address the missing gaps in our knowledge. For example, once genes with known function are linked to the ontologies, we can ask how many genes in a genome are not associated with a molecular function, biological process, expression pattern or cellular location. Similarly, we can examine which processes, functions and cellular components are described by known molecular entities and which aspects of biology have no, or unexpectedly few, genes associated with them. There is much to be explored.
An ontology makes explicit knowledge that is usually diffusely embedded in notebooks, textbooks and journals or just held in academic memories, and therefore represents a formalization of the current state of a field. Integrating this knowledge poses two problems. First, not everyone in a field agrees on either the facts or the relationships. Second, knowledge changes with time, even in apparently ossified subjects such as anatomy — for example, we still do not know all the cell-lineage relationships of human anatomy. The first problem can be adequately handled if ontologies are felt to be owned by the field rather than just the individual authors. Mechanisms for sharing the development with those in the field — including establishing a forum for those interested in similar areas of ontology development and soliciting or incorporating feedback from individual researchers — will facilitate public ownership. Public support is just as important for maintenance of ontologies as it is for databases. This will only happen if ontologies are actively curated and this, of course, is the solution to the second problem.
If ontologies are properly curated over the longer term, they will come to be seen as modern-day (albeit terse) textbooks providing online and up-to-date biological expertise for their area. In another sense, they will provide the common standards needed for producing a strong biological framework for integrating data sets. Ontologies therefore provide the formal basis for an integrative approach to biology that complements the traditional deductive methodology.