New Approaches and Tools
for Zebrafish Phenotype Curation
Erik Segerdell, Melissa Haendel, Ceri Van Slyke and Monte
Westerfield
Zebrafish Information
Network (ZFIN), http://zfin.org
The zebrafish is a premiere
model organism for the study of vertebrate genetics and development. Genetic
screens have identified thousands of zebrafish mutations affecting many
developmental and physiological processes, while new projects are expected to
produce a vast amount of phenotypic data in the next few years. Furthermore,
many laboratories are now using antisense oligonucleotides, or morpholinos, to
produce phenotypes through targeted disruption of gene function.
A major goal at ZFIN is to
provide comprehensive informatics support for this rapidly growing set of
information that will foster comparisons of vertebrate gene structure and
function across species, including human. ZFINÕs mutant database currently
supports a small constrained list of phenotypic keywords and brief free-text descriptions.
These fields, however, are difficult to maintain and they limit curatorsÕ
ability to consistently annotate data that is easily searchable and that can be
linked to other bioinformatics resources. To make our phenotype coverage much
more robust and flexible, we have developed orthogonal ontologies to describe
developmental staging, anatomy, biological processes, and phenotypic
attributes. We have begun trial curation with these ontologies, including the
structured vocabulary and syntax developed by the cross-species Phenotype and
Trait Ontology (PATO) consortium. We also have distributed a portable
phenotype-annotation database to curators taking part in trial annotation and
to several zebrafish laboratories as a means to collect new data and to encourage
feedback on the ontologies and user interface.
Curation
of phenotypic data in FlyBase
Gillian
Millburn
Flybase,
University of Cambridge, UK
The focus of my talk will be curation of phenotypic data in
FlyBase. I will describe how both single mutant phenotypes and genetic interactions involving mutants of more than one gene are captured using controlled vocabularies and a uniform syntax, which allows consistent searching across all the data. FlyBase does not yet use
the PATO (Phenotype and Trait Ontology) system for describing phenotypic data, but has been involved in developing the PATO vocabularies
and intends to implement PATO type curation in the future. I will end with a list of things I would like to learn from the meeting to help with implementation of PATO in FlyBase - any tools that groups are using for PATO curation or lessons learnt from retrofitting old phenotypic descriptions to the PATO model would be most useful!
The
Mammalian Phenotype Ontology: Development and Use in Phenotypic Data Annotation
at Mouse Genome Informatics
Cynthia L Smith, Carroll-Ann W. Goldsmith, Janan T. Eppig.
Mouse Genome Informatics
(MGI, http://www.informatics.jax.org), The
Jackson Laboratory, Bar Harbor, ME.
The Mammalian Phenotype
Ontology has been developed as part of the Mouse Genome Informatics (MGI)
program to provide a standardized phenotypic trait vocabulary for annotating
phenotypic data. This vocabulary contains more than 3000 phenotypic terms and
is actively used by the mouse and rat model organism databases (MGD and RGD) to
annotate phenotypes of mutants, QTLs, complex genotypes, and strains. The
Mammalian Phenotype Ontology is organized as a directed acyclic graph (DAG),
with its highest level terms consisting of phenotypic trait descriptors covering
physiological systems, behavior, embryogenesis and survival/aging . Physiological systems are further
divided into morphological and physiological branches at the next node level.
The Mammalian Phenotype (MP) Ontology may be accessed in browser format at http://www.informatics.jax.org/searches/MP_form.shtml
and the files may be accessed in both GO flat file format and OBO format from
our ftp site at ftp://ftp.informatics.jax.org/pub/reports/index.html#pheno.
We
will describe new strategies being developed for improving accessibility to
mouse phenotype data at MGI in structured formats using the MP vocabulary to
support computational analyses, as well as provide human-friendly interfaces
that enhance our ability to explore genotype-phenotype relationships. These
include standardized annotation methods for baseline allele and mutant data for
each genotype assayed (genotype=mutations/alleles on defined strain background)
and associations between mouse phenotypes and orthologous human gene mutations
or human disease syndromes for which particular mouse genotypes serve as
phenotypic models.
The Mouse Genome Database (MGD) is supported by NIH/NHGRI grant
HG00330
Phenotype Data at the Rat Genome
Database
Charles Wang
RGD
The rat's use as a model for
human disease results in extensive pathophysiology, toxicology and pharmacology
studies. In order to meet the needs
of these research communities, the Rat Genome Database emphasizes acquiring and
presenting phenotype and disease data in standard formats that will allow the
user to search, organize and mine data according to interest area. In the past year, RGD added two
ontologies, Mammalian Phenotype Ontology and Disease Ontology, to prioritize
and annotate biological data for genes, QTLs and strains. RGD is collaborating with the Mouse
Genome Database to create the Mammalian Phenotype Ontology and has adapted the
disease branch of the MESH vocabulary for its Disease Ontology. RGD currently has over 1800 phenotype
and 2000 disease annotations.
In accordance with the PATO design for phenotype data presentation, RGD
has developed an expanded annotation system that allows for the addition of
assays and values as well as environmental conditions and genetic factors that
influence the phenotype. This
expanded annotation format will initially be used for phenotype data generated
for the strains of the Programs for Genomic Applications project (http://pga.mcw.edu/) followed by full annotation
of QTLs and associated strains.
RGD also has developed tools that incorporate the power of the Mammalian
Phenotype Ontology. The Mammalian
Phenotype Ontology and Disease Ontology annotations provide users with powerful
search, sort and navigation functions.
As shown in a later presentation, data mining at RGD has been further
enhanced with the addition of these annotations to our Gene Annotation Tool,
and as well as the creation of ontology tracks on the GBrowse as a means of
attaching biology to the genome.
Future developments at RGD for phenotype data include the expansion of
phenotype reports, the provision of baseline data for major phenotypes and the
expansion of disease annotations to include information on genetic factors and
environmental conditions.
Mutant phenotypes in the
Saccharomyces Genome Database and Candida Genome Database
Costanzo, M.C., Arnaud, M., Oughtred, R., Skrzypek, M.,
Balakrishnan, R., Christie, K.R., Dolinski, K., Dwight, S.S., Engel, S.R.,
Fisk, D.G., Hirschman, J.E., Hong, E.L., Nash, R., Theesfeld, C.L., Andrada,
R., Binkley, G., Dong, Q., Lane, C., Miyasato, S., Schroeder, M., Sethuraman,
A., Weng, S., Botstein, D., Sherlock, G., and Cherry, J.M.
The
Saccharomyces Genome Database (SGD) gathers information about genes and
proteins of the yeast Saccharomyces cerevisiae. The Candida Genome Database (CGD) is a new database for the
fungal pathogen Candida albicans developed at Stanford in collaboration with
the SGD project. Data, including
information about mutant phenotypes, are collected from one or more of the
following sources: from the published literature, via manual curation of
individual research articles; from datasets generated by large-scale studies;
and from researchers, via direct submission of results.
SGD contains over 23,000
phenotype entries; more than 4,000 of these come from individual annotations,
and the rest are derived from a handful of systematic mutant analysis
studies. Information about each
phenotype is divided into three free-text categories: Mutant Type, Mutant
Phenotype, and Notes. The great variety of free-text entries in each of these
categories currently makes it very difficult to perform systematic searches or
bioinformatics analysis. However,
having already collected such a vast amount of information, SGD is well
positioned to overhaul its phenotype data and implement a system that is amenable
to computer-assisted analysis.
CGD
has implemented a more controlled system for phenotype curation than SGD, and
is in the process of refining these procedures. CGD uses controlled vocabularies that describe broad groups
of related phenotypes, coupled with free-text comments that provide further
details.
We
are now developing and testing a controlled phenotype vocabulary for SGD as
well as some other phenotype curation methods. We will present for discussion
examples of the challenges we are facing.
Phenotype Curation Strategy in Gramene
Junjian Ni for Gramene Project
Results of phenotypic studies including mutant genes and quantitative trait loci (QTL) on cereal crops are rapidly accumulating. To enhance the utility and accessibility of this data resource, Gramene has developed structures and guidelines for curating and housing phenotype results, relevant background information, and making these data available on-line to the research community. To facilitate rapid development and curation, we have separated phenotypic curation into three levels.
LEVEL-1 curation, currently underway, provides users with core information
about phenotype description, trait association, map position, and related database cross references. LEVEL-2 curation is under development and will enhance the core information by providing associations to controlled vocabulary, information about phenotype allele effects and allelic interaction, nucleotide and protein sequence links if feasible, details regarding the phenotypic assays and environmental conditions under which each phenotype was evaluated, a description of the germplasm or genetic population(s) used in the analysis, and the statistical methodology employed, etc. LEVEL-3 will solicit direct contributions from the research community and will provide a repository of raw genotype and phenotype data so that future re-analysis of large data sets will be possible. The current release of the phenotype databases houses LEVEL-1 information for more than 7,000 QTLs in rice, maize, barley, oat and wild rice
curated from the published literature, and also 424 mutant genes from rice as well. Future releases will include more phenotype studies for mutant genes and QTL in other cereal crops as well as LEVEL-2 and LEVEL-3 curation for existing phenotypes. This talk will focus on our curation strategy, curation tools and our progress to date.
Phenotype
Annotation Tools at MaizeGDB: www.maizegdb.org
Mary
Polacco1, Carolyn
Lawrence2, Sanford Baran3
1USDA-ARS, University of Missouri, Columbia, MO 65211;
2 Iowa State
University, Ames, IA 50011; 3 Sanford Baran , Boulder CO 80306
Tools for curator and
community annotation of Maize GDB phenotypes will be presented and demonstrated
on-line. Phenotypes are
described by a number of fields, including alleles, stocks, images, references, comments and controlled
vocabularies for traits, body
parts, developmental stages and metabolic pathways. In MaizeGDB, the Plant
Ontology accessions are associated to the "body parts" and soon to "developmental
stages". The first data were
obtained in 1991 from the Maize
Newsletter gene and stock center lists; with updates using an APT Sybase
client, whose design is similar in many respects to the community tool. We are
especially grateful to the Maize Genetics Cooperation Stock Center as an
ongoing source of curation for phenotype data . We acknowledge MG Neuffer, and
Rob MartiensenÕs research groups for providing systematized phenotype
descriptors of mutant stocks and to MG Neuffer for providing thousands of
images for permanent stocks now in
the Stock Center.
The Neurospora crassa community genome annotation project.
Heather M. Hood1, James E. Galagan2, Bruce W.
Birren2, Jay C. Dunlap3, and Matthew S. Sachs1
1 Department of
Environmental and Biomolecular Systems, Oregon Health & Science University,
Beaverton, OR 97007; 2 Broad Institute, Cambridge, MA 02141; 3
Department of Genetics, Dartmouth Medical School, Hanover, NH 03755.
The
filamentous fungus Neurospora crassa is a key model organism.
Its genome sequence was publicly released in 2001. Approximately 10,000 protein-coding
genes are predicted in its ~40 megabase genome based on automated
annotation. While automated
annotation, including prediction of protein-coding regions and intron-exon boundaries,
is crucial for analyzing genome sequences, errors in automated processes occur
(e.g., incorrect prediction of splice sites). Moreover, automated annotations only provide structural
predictions without significant functional information. To produce richer and more accurate
annotation, additional manual annotation and curation is necessary. Therefore, we are developing community
annotation resources in conjunction with curator-based manual annotation to
improve our understanding of the Neurospora genome.
A central goal of this project is to integrate phenotypic annotation
with genomic sequence. A
substantial body of phenotypic information has already been collected for
approximately 1000 genetic loci and ongoing gene knockout experiments are expanding
our phenotypic knowledge of the effects of mutations. To produce the most valuable annotation data, we will
establish a controlled vocabulary to describe abnormal (mutant)
phenotypes. By providing multiple
layers of annotation and integration of genotypic and phenotypic data, the
research community will receive a first-rate resource on the biology of Neurospora.
WormBase Gene Expression
Curation and Anatomy/Developmental Ontology
Wen J. Chen, Igor Antoshechkin, Raymond Lee, WormBase
Consortium.
There are three classes of
gene expression data in WormBase. The first is called expression pattern. They
are data obtained from individual studies, such as reporter gene analysis,
antibody staining and Northern analysis. To curate expression patterns,
information regarding target genes, anatomical and temporal expression
patterns, experimental methods, and references was manually extracted from
published literature following standard operating procedures. The other two
classes are results from microarray and serial analysis of gene expression
(SAGE). These large-scale data sets were usually obtained from authors and
parsed with scripts according to WormBase data models before being entered into
the database. Since October 2003, we have also started to curate data about
regulation on gene expression by mutation, transgene, RNAi, drug, or
temperature.
All gene expression and
regulation data can be directly accessed from WormBase gene page. Users can
also access related information such as transgenic lines and antibodies.
WormBase has a complete and up-to-date collection of expression patterns and
antibodies published in literature. Curation of integrated transgenic lines is
~80% complete, which is assisted by Textpresso software. The remainder is
pending on a complete full-text paper collection.
We have developed
anatomy/developmental ontologies regarding cell, anatomical parts and life
stages for C. elegans. There are
~5,000 anatomy ontology terms and 68 terms for developmental life stages
organized in DAG structure. Developmental ontology terms have been applied to
all gene expression data. In July, 2004, we started to apply anatomy ontology
terms to gene expression curation. Old gene expression data will be
retro-fitted using anatomy ontology in the near future.
Overview of the Plant Ontology: development and
general structure
Katica Ilic, TAIR,
Carnegie Institution of Washington, 260 Panama Street, Stanford, 94305 CA
The Plant Ontology
Consortium (POC) is collaboration among plant databases and experts in plant
systematics, botany and genomics (www.plantontology.org). Our goal is to
develop, curate and share controlled vocabularies that uniformly describe
flowering plant anatomy/morphology and growth/developmental stages. These
ontologies will provide a semantic framework for meaningful cross-species
queries across databases such as Gramene, TAIR, Maize GDB and others.
Implementation of standard descriptors will ensure consistent use of these
vocabularies in the annotation of tissue and/or growth stage specific
expression of genes, proteins and phenotypes. The first task of the POC project
was to efficiently integrate existing species-specific vocabularies currently
in use to describe Arabidopsis, maize and rice anatomy and morphology. The
first version of Plant Structure Ontology, released in July 2004, achieved this
goal by spanning two major taxonimic divisions: monocots and dicots. We are
currently working on the Growth and Developmental Stages Ontology, and are
primarily focused on Arabidopsis, rice and maize. In coming years, POC will
extend its controlled vocabularies to encompass legumes, Solanaceae and other flowering plants. The latest version of
the Plant Structure Ontology will be presented as well as key organizing
principles and rules followed in developing plant ontologies.
The project is
supported by National Science Foundation grant No. 0321666 to the Plant
Ontology Consortium.
dictyBase: A New Gene Page
Petra Fey, dictyBase
The
dictyBase database holds sequence data from 3 different sources: Gene Predictions from the Sequencing
Centers, GenBank records, and ESTs.
Using the available sequences and other resources, curators compare the
different sequences and add a 'Curated Model,' which represents the best
possible gene model. As a result,
many different sequences and gene models are associated with a curated gene.
The
objective in developing a new Gene Page was to give user-friendly access to
this data through one page. The
original data model (generously provided by SGD) was designed to represent a
one-to-one relationship of locus to sequence. When more sequences and gene models were added to a locus it
became confusing to users; we were unable to clearly display which sequence
they were retrieving from the page.
As a consequence, we implemented a new data model, which allows the
addition of as many sequences/gene models as necessary to a Gene Page.
Every
sequence on the newly designed Gene Page is linked to its own Sequence Page
with extensive sequence information.
The Gene Page is therefore at the top of a hierarchy with information
such as Gene Ontology, Literature, Phenotype, and a summary paragraph. From the Gene Page the user can access
more specific information about the different sequences or other available
information. We would like to get
comments about this new Gene Page and generate discussion in light of the GMOD
meeting earlier this year 'Towards a Unified Gene Page.'
Morphological Data in
GrainGenes: Challenging
Nomenclature and Creating Image-Friendly Maps.
Victoria Carollo, David Matthews, Gerard Lazo, Olin Anderson
USDA-ARS, Albany, CA 94710
In 2004, the GrainGenes
curators have completely reorganized the GrainGenes database into a relational
database architecture, and have integrated the website and database portal into
one tool. Throughout this
restructuring, we have reviewed all data types and updated our models to
reflect new trends in molecular data.
A companion database, Grainotypes, has been designed to serve data
generated by wheat and barley genotyping centers, to correlate molecular marker
alleles with genes and quantitative traits. The Triticeae have a longstanding authoritative system
of gene and trait nomenclature under the auspices of the Wheat Gene Catalog and
the Barley Genetics Newsletter, which may be both helpful and thorny for any
attempt at a Trait Ontology. Our method of assigning GO terms to ESTs will be
discussed. Finally, a simple
html-based tool to visualize graphical data describing mapped mutations will be
shown.
Textpresso: A Tool to
Expedite Literature Curation Tasks
Andrei Petcherski, Juancarlos Chan, Eimear Kenny, Hans-Michael Mueller
and
Paul Sternberg
We have previously reported
Textpresso, a web-based system to automatically retrieve and extract
information from biological literature (http://www.textpresso.org). The Textpresso system is based on shallow text
parsing to identify terms of interest, which have been organized into
categories, in a corpus of ASCII journal abstract and article text. The parsed
text is outputted as XML and searchable by either keywords and/or categories
via multiple user-friendly interfaces. This tool is being generated for use as
a general user-level search engine, a specialized tool for curators, and
ultimately, for automatic extraction of biological facts from text.
We have used Textpresso to
automatically mine some simple types of data from C. elegans literature. For example, Textpresso was used to
locate all integrated transgenes and to extract the associations between
alleles and genes. Some more complicated biological facts expressed within a
single sentence have been extracted in a semi-automated fashion. For example,
we have integrated Textpresso to the Wormbase curation pipeline to expedite the
extraction of genetic interaction information from the literature. A prototype
curation interface has been developed to enable a curator to extract data from
sentences returned by a Textpresso query for genetic interaction. We found that
sentences matching a Textpresso query for gene-gene interactions are enriched
3-fold compared to sentences that mention two or more gene names and 39-fold
compared to random sentences from the literature. We have successfully
completed one large scale experiment where 3,702 distinct gene-gene interacting
pairs were extracted from the sentences of returned by Textpresso from 2,200
full text journal articles in less than 200 curator hours. The project
currently focuses on C. elegans
literature, however an expansion to the literature of other model organisms and
biological domains is underway. The project is part of WormBase (http://www.wormbase.org) and GMOD (http://www.gmod.org).
Multiple Ontologies, Data
Mining and Attaching Biology to the Genome
Mary Shimoyama
Rat Genome Database
The use of multiple
ontologies allows the Rat Genome Database to not only provides comprehensive,
standardized biological information on genes, QTLs and strains, but it also
allows RGD to develop sophisticated search and data mining tools, as well as
provide genome-wide visualization of categories of data. The Multiple Ontology Browser, Gene
Annotation Tool, GBrowse with annotation tracks and Gview will be presented and
discussed. In addition, future developments
will be profiled.
Report
on the 2004 BioCreAtIvE Workshop
Lynette Hirschman, Marc Colosimo, Alexander Yeh, Alexander Morgan
The MITRE Corporation, Bedford, MA
Christian Blaschke, Alfonso Valencia
Centro Nacional de Biotecnologia,
Universidad Autonoma, Madrid, Spain
The first BioCreAtIvE
Workshop (Critical Assessment of Information Extraction in Biology) was held in
Granada, Spain March 28-31, 2004.
The goal on the workshop was to provide a set of common challenge
evaluation tasks to assess the state of the art for text mining applied to
biological problems. The assessment focused on two tasks. The first dealt with extraction of gene
or protein names from MEDLINE text, and their mapping into standardized gene
identifiers for three model organism databases (fly, mouse, yeast). The second
task addressed issues of functional annotation, requiring systems to provide
gene ontology annotations for proteins, given full text articles. Overall, 27 groups participated in the
assessment.
The results for gene/protein name extraction showed that a number of systems (4)
were able to extract general gene names from sentences at over 80% balanced
precision and recall. For the name
normalization subtask, the results ranged from a high for yeast of 92% balanced
precision and recall, to somewhat lower scores for fly (82%) and mouse (79%),
due to extensive ambiguity among gene synonyms and overlap with standard
English vocabulary.
For
the functional annotation task, systems were asked to identify a segment of
text as evidence for a GO annotation, given the protein. When both protein name
and the GO annotation were given, several systems provided correct evidence for
the GO predictions 25-30% of the time; two systems provided a much higher rate
of predictions being correct (50% and 75-80%) by predicting only for high
confidence cases. When the systems
were given only the protein name, the results were significantly lower (~10%
for systems providing predictions for all proteins and ~30-35% for the high
precision systems providing only a few answers).
Milwaukee,
WI 2003 Continued: Proposal to Journals for a Database Section
Mary
Polacco, MaizeGDB Curator
USDA-ARS, 203 Curtis Hall,
University of Missouri, Columbia, MO 65211
As follow-up to last year's
discussion, I will present a simple method whereby some information in the
literature may be rendered more amenable to electronic tool mining and
annotation. Datatypes include:
species and subspecies; cultivar/strain/stock; genbank accessions; gene and allele names and/or symbols and,
their association with genbank accessions; gene products; GO terms. Last year,
during the final wrap-up session, this group had a very good discussion about
the utility of such information being included in journals, as a "database
section", with "tags", and provided to the community in the same manner that
abstracts are currently made available. I will ask the group review the
datatypes useful for their species during my 15 min interactive presentation,
and work towards a strategy for implementation. The implementation is
envisioned to require cooperation of major journals and literature services
that currently provide electronically, citations and abstracts, to also provide
the database section.