New Approaches and Tools for Zebrafish Phenotype Curation -->

2004 International Biocurator Meeting
September 20-21 - Eugene, Oregon
ABSTRACTS


New Approaches and Tools for Zebrafish Phenotype Curation

 

Erik Segerdell, Melissa Haendel, Ceri Van Slyke and Monte Westerfield

Zebrafish Information Network (ZFIN), http://zfin.org

 

The zebrafish is a premiere model organism for the study of vertebrate genetics and development. Genetic screens have identified thousands of zebrafish mutations affecting many developmental and physiological processes, while new projects are expected to produce a vast amount of phenotypic data in the next few years. Furthermore, many laboratories are now using antisense oligonucleotides, or morpholinos, to produce phenotypes through targeted disruption of gene function.

 

A major goal at ZFIN is to provide comprehensive informatics support for this rapidly growing set of information that will foster comparisons of vertebrate gene structure and function across species, including human. ZFINÕs mutant database currently supports a small constrained list of phenotypic keywords and brief free-text descriptions. These fields, however, are difficult to maintain and they limit curatorsÕ ability to consistently annotate data that is easily searchable and that can be linked to other bioinformatics resources. To make our phenotype coverage much more robust and flexible, we have developed orthogonal ontologies to describe developmental staging, anatomy, biological processes, and phenotypic attributes. We have begun trial curation with these ontologies, including the structured vocabulary and syntax developed by the cross-species Phenotype and Trait Ontology (PATO) consortium. We also have distributed a portable phenotype-annotation database to curators taking part in trial annotation and to several zebrafish laboratories as a means to collect new data and to encourage feedback on the ontologies and user interface.



Curation of phenotypic data in FlyBase

 

Gillian Millburn

Flybase, University of Cambridge, UK

 

The focus of my talk will be curation of phenotypic data in FlyBase. I will describe how both single mutant phenotypes and genetic interactions involving mutants of more than one gene are captured using controlled vocabularies and a uniform syntax, which allows consistent searching across all the data. FlyBase does not yet use the PATO (Phenotype and Trait Ontology) system for describing phenotypic data, but has been involved in developing the PATO vocabularies and intends to implement PATO type curation in the future. I will end with a list of things I would like to learn from the meeting to help with implementation of PATO in FlyBase - any tools that groups are using for PATO curation or lessons learnt from retrofitting old phenotypic descriptions to the PATO model would be most useful!



The Mammalian Phenotype Ontology: Development and Use in Phenotypic Data Annotation at Mouse Genome Informatics

 

Cynthia L Smith, Carroll-Ann W. Goldsmith, Janan T. Eppig.

Mouse Genome Informatics (MGI, http://www.informatics.jax.org), The Jackson Laboratory, Bar Harbor, ME.

 

The Mammalian Phenotype Ontology has been developed as part of the Mouse Genome Informatics (MGI) program to provide a standardized phenotypic trait vocabulary for annotating phenotypic data. This vocabulary contains more than 3000 phenotypic terms and is actively used by the mouse and rat model organism databases (MGD and RGD) to annotate phenotypes of mutants, QTLs, complex genotypes, and strains. The Mammalian Phenotype Ontology is organized as a directed acyclic graph (DAG), with its highest level terms consisting of phenotypic trait descriptors covering physiological systems, behavior, embryogenesis and survival/aging .  Physiological systems are further divided into morphological and physiological branches at the next node level. The Mammalian Phenotype (MP) Ontology may be accessed in browser format at http://www.informatics.jax.org/searches/MP_form.shtml and the files may be accessed in both GO flat file format and OBO format from our ftp site at ftp://ftp.informatics.jax.org/pub/reports/index.html#pheno.

 

We will describe new strategies being developed for improving accessibility to mouse phenotype data at MGI in structured formats using the MP vocabulary to support computational analyses, as well as provide human-friendly interfaces that enhance our ability to explore genotype-phenotype relationships. These include standardized annotation methods for baseline allele and mutant data for each genotype assayed (genotype=mutations/alleles on defined strain background) and associations between mouse phenotypes and orthologous human gene mutations or human disease syndromes for which particular mouse genotypes serve as phenotypic models.


 

The Mouse Genome Database (MGD) is supported by NIH/NHGRI grant HG00330
Phenotype Data at the Rat Genome Database

 

Charles Wang

RGD

 

The rat's use as a model for human disease results in extensive pathophysiology, toxicology and pharmacology studies.  In order to meet the needs of these research communities, the Rat Genome Database emphasizes acquiring and presenting phenotype and disease data in standard formats that will allow the user to search, organize and mine data according to interest area.  In the past year, RGD added two ontologies, Mammalian Phenotype Ontology and Disease Ontology, to prioritize and annotate biological data for genes, QTLs and strains.  RGD is collaborating with the Mouse Genome Database to create the Mammalian Phenotype Ontology and has adapted the disease branch of the MESH vocabulary for its Disease Ontology.  RGD currently has over 1800 phenotype and 2000 disease annotations.   In accordance with the PATO design for phenotype data presentation, RGD has developed an expanded annotation system that allows for the addition of assays and values as well as environmental conditions and genetic factors that influence the phenotype.  This expanded annotation format will initially be used for phenotype data generated for the strains of the Programs for Genomic Applications project (http://pga.mcw.edu/) followed by full annotation of QTLs and associated strains.  RGD also has developed tools that incorporate the power of the Mammalian Phenotype Ontology.  The Mammalian Phenotype Ontology and Disease Ontology annotations provide users with powerful search, sort and navigation functions.  As shown in a later presentation, data mining at RGD has been further enhanced with the addition of these annotations to our Gene Annotation Tool, and as well as the creation of ontology tracks on the GBrowse as a means of attaching biology to the genome.  Future developments at RGD for phenotype data include the expansion of phenotype reports, the provision of baseline data for major phenotypes and the expansion of disease annotations to include information on genetic factors and environmental conditions.


 


Mutant phenotypes in the Saccharomyces Genome Database and Candida Genome Database

 

Costanzo, M.C., Arnaud, M., Oughtred, R., Skrzypek, M., Balakrishnan, R., Christie, K.R., Dolinski, K., Dwight, S.S., Engel, S.R., Fisk, D.G., Hirschman, J.E., Hong, E.L., Nash, R., Theesfeld, C.L., Andrada, R., Binkley, G., Dong, Q., Lane, C., Miyasato, S., Schroeder, M., Sethuraman, A., Weng, S., Botstein, D., Sherlock, G., and Cherry, J.M.

 

The Saccharomyces Genome Database (SGD) gathers information about genes and proteins of the yeast Saccharomyces cerevisiae.  The Candida Genome Database (CGD) is a new database for the fungal pathogen Candida albicans developed at Stanford in collaboration with the SGD project.  Data, including information about mutant phenotypes, are collected from one or more of the following sources: from the published literature, via manual curation of individual research articles; from datasets generated by large-scale studies; and from researchers, via direct submission of results. 



SGD contains over 23,000 phenotype entries; more than 4,000 of these come from individual annotations, and the rest are derived from a handful of systematic mutant analysis studies.  Information about each phenotype is divided into three free-text categories: Mutant Type, Mutant Phenotype, and Notes. The great variety of free-text entries in each of these categories currently makes it very difficult to perform systematic searches or bioinformatics analysis.  However, having already collected such a vast amount of information, SGD is well positioned to overhaul its phenotype data and implement a system that is amenable to computer-assisted analysis.



CGD has implemented a more controlled system for phenotype curation than SGD, and is in the process of refining these procedures.  CGD uses controlled vocabularies that describe broad groups of related phenotypes, coupled with free-text comments that provide further details.



We are now developing and testing a controlled phenotype vocabulary for SGD as well as some other phenotype curation methods. We will present for discussion examples of the challenges we are facing.



Phenotype Curation Strategy in Gramene

 

Junjian Ni for Gramene Project

 

Results of phenotypic studies including mutant genes and quantitative trait loci (QTL) on cereal crops are rapidly accumulating. To enhance the utility and accessibility of this data resource, Gramene has developed structures and guidelines for curating and housing phenotype results, relevant background information, and making these data available on-line to the research community. To facilitate rapid development and curation, we have separated phenotypic curation into three levels.


LEVEL-1 curation, currently underway, provides users with core information about phenotype description, trait association, map position, and related database cross references. LEVEL-2 curation is under development and will enhance the core information by providing associations to controlled vocabulary, information about phenotype allele effects and allelic interaction, nucleotide and protein sequence links if feasible, details regarding the phenotypic assays and environmental conditions under which each phenotype was evaluated, a description of the germplasm or genetic population(s) used in the analysis, and the statistical methodology employed, etc. LEVEL-3 will solicit direct contributions from the research community and will provide a repository of raw genotype and phenotype data so that future re-analysis of large data sets will be possible. The current release of the phenotype databases houses LEVEL-1 information for more than 7,000 QTLs in rice, maize, barley, oat and wild rice curated from the published literature, and also 424 mutant genes from rice as well. Future releases will include more phenotype studies for mutant genes and QTL in other cereal crops as well as LEVEL-2 and LEVEL-3 curation for existing phenotypes. This talk will focus on our curation strategy, curation tools and our progress to date.



Phenotype Annotation Tools at MaizeGDB: www.maizegdb.org

 

Mary Polacco1, Carolyn Lawrence2, Sanford Baran3

1USDA-ARS, University of Missouri, Columbia, MO 65211; 2 Iowa State University, Ames, IA 50011; 3 Sanford Baran , Boulder CO 80306

 

Tools for curator and community annotation of Maize GDB phenotypes will be presented and demonstrated on-line.   Phenotypes are described by a number of fields, including  alleles, stocks, images, references, comments and controlled vocabularies for traits,  body parts, developmental stages and metabolic pathways. In MaizeGDB, the Plant Ontology accessions are associated to the "body parts" and soon to "developmental stages".  The first data were obtained  in 1991 from the Maize Newsletter gene and stock center lists; with updates using an APT Sybase client, whose design is similar in many respects to the community tool. We are especially grateful to the Maize Genetics Cooperation Stock Center as an ongoing source of curation for phenotype data . We acknowledge MG Neuffer, and Rob MartiensenÕs research groups for providing systematized phenotype descriptors of mutant stocks and to MG Neuffer for providing thousands of images  for permanent stocks now in the Stock Center.



The Neurospora crassa community genome annotation project.

 

Heather M. Hood1, James E. Galagan2, Bruce W. Birren2, Jay C. Dunlap3, and Matthew S. Sachs1

1 Department of Environmental and Biomolecular Systems, Oregon Health & Science University, Beaverton, OR 97007; 2 Broad Institute, Cambridge, MA 02141; 3 Department of Genetics, Dartmouth Medical School, Hanover, NH  03755.

 

The filamentous fungus Neurospora crassa is a key model organism.  Its genome sequence was publicly released in 2001.  Approximately 10,000 protein-coding genes are predicted in its ~40 megabase genome based on automated annotation.  While automated annotation, including prediction of protein-coding regions and intron-exon boundaries, is crucial for analyzing genome sequences, errors in automated processes occur (e.g., incorrect prediction of splice sites).  Moreover, automated annotations only provide structural predictions without significant functional information.  To produce richer and more accurate annotation, additional manual annotation and curation is necessary.  Therefore, we are developing community annotation resources in conjunction with curator-based manual annotation to improve our understanding of the Neurospora genome.  A central goal of this project is to integrate phenotypic annotation with genomic sequence.  A substantial body of phenotypic information has already been collected for approximately 1000 genetic loci and ongoing gene knockout experiments are expanding our phenotypic knowledge of the effects of mutations.  To produce the most valuable annotation data, we will establish a controlled vocabulary to describe abnormal (mutant) phenotypes.  By providing multiple layers of annotation and integration of genotypic and phenotypic data, the research community will receive a first-rate resource on the biology of Neurospora.


 


WormBase Gene Expression Curation and Anatomy/Developmental Ontology

 

Wen J. Chen, Igor Antoshechkin, Raymond Lee, WormBase Consortium.

 

There are three classes of gene expression data in WormBase. The first is called expression pattern. They are data obtained from individual studies, such as reporter gene analysis, antibody staining and Northern analysis. To curate expression patterns, information regarding target genes, anatomical and temporal expression patterns, experimental methods, and references was manually extracted from published literature following standard operating procedures. The other two classes are results from microarray and serial analysis of gene expression (SAGE). These large-scale data sets were usually obtained from authors and parsed with scripts according to WormBase data models before being entered into the database. Since October 2003, we have also started to curate data about regulation on gene expression by mutation, transgene, RNAi, drug, or temperature.

 

All gene expression and regulation data can be directly accessed from WormBase gene page. Users can also access related information such as transgenic lines and antibodies. WormBase has a complete and up-to-date collection of expression patterns and antibodies published in literature. Curation of integrated transgenic lines is ~80% complete, which is assisted by Textpresso software. The remainder is pending on a complete full-text paper collection.

 

We have developed anatomy/developmental ontologies regarding cell, anatomical parts and life stages for C. elegans. There are ~5,000 anatomy ontology terms and 68 terms for developmental life stages organized in DAG structure. Developmental ontology terms have been applied to all gene expression data. In July, 2004, we started to apply anatomy ontology terms to gene expression curation. Old gene expression data will be retro-fitted using anatomy ontology in the near future.



 

Overview of the Plant Ontology: development and general structure

 

Katica Ilic, TAIR, Carnegie Institution of Washington, 260 Panama Street, Stanford, 94305 CA

 

The Plant Ontology Consortium (POC) is collaboration among plant databases and experts in plant systematics, botany and genomics (www.plantontology.org). Our goal is to develop, curate and share controlled vocabularies that uniformly describe flowering plant anatomy/morphology and growth/developmental stages. These ontologies will provide a semantic framework for meaningful cross-species queries across databases such as Gramene, TAIR, Maize GDB and others. Implementation of standard descriptors will ensure consistent use of these vocabularies in the annotation of tissue and/or growth stage specific expression of genes, proteins and phenotypes. The first task of the POC project was to efficiently integrate existing species-specific vocabularies currently in use to describe Arabidopsis, maize and rice anatomy and morphology. The first version of Plant Structure Ontology, released in July 2004, achieved this goal by spanning two major taxonimic divisions: monocots and dicots. We are currently working on the Growth and Developmental Stages Ontology, and are primarily focused on Arabidopsis, rice and maize. In coming years, POC will extend its controlled vocabularies to encompass legumes, Solanaceae and other flowering plants. The latest version of the Plant Structure Ontology will be presented as well as key organizing principles and rules followed in developing plant ontologies.

 

The project is supported by National Science Foundation grant No. 0321666 to the Plant Ontology Consortium.



 dictyBase: A New Gene Page

 

Petra Fey, dictyBase

 

The dictyBase database holds sequence data from 3 different sources:  Gene Predictions from the Sequencing Centers, GenBank records, and ESTs.  Using the available sequences and other resources, curators compare the different sequences and add a 'Curated Model,' which represents the best possible gene model.  As a result, many different sequences and gene models are associated with a curated gene.

 

The objective in developing a new Gene Page was to give user-friendly access to this data through one page.  The original data model (generously provided by SGD) was designed to represent a one-to-one relationship of locus to sequence.  When more sequences and gene models were added to a locus it became confusing to users; we were unable to clearly display which sequence they were retrieving from the page.  As a consequence, we implemented a new data model, which allows the addition of as many sequences/gene models as necessary to a Gene Page.

 

Every sequence on the newly designed Gene Page is linked to its own Sequence Page with extensive sequence information.  The Gene Page is therefore at the top of a hierarchy with information such as Gene Ontology, Literature, Phenotype, and a summary paragraph.  From the Gene Page the user can access more specific information about the different sequences or other available information.  We would like to get comments about this new Gene Page and generate discussion in light of the GMOD meeting earlier this year 'Towards a Unified Gene Page.'


 


Morphological Data in GrainGenes:  Challenging Nomenclature and Creating Image-Friendly Maps.

 

Victoria Carollo, David Matthews, Gerard Lazo, Olin Anderson

USDA-ARS, Albany, CA  94710

 

In 2004, the GrainGenes curators have completely reorganized the GrainGenes database into a relational database architecture, and have integrated the website and database portal into one tool.  Throughout this restructuring, we have reviewed all data types and updated our models to reflect new trends in molecular data.  A companion database, Grainotypes, has been designed to serve data generated by wheat and barley genotyping centers, to correlate molecular marker alleles with genes and quantitative traits.   The Triticeae have a longstanding authoritative system of gene and trait nomenclature under the auspices of the Wheat Gene Catalog and the Barley Genetics Newsletter, which may be both helpful and thorny for any attempt at a Trait Ontology. Our method of assigning GO terms to ESTs will be discussed.  Finally, a simple html-based tool to visualize graphical data describing mapped mutations will be shown.



Textpresso: A Tool to Expedite Literature Curation Tasks

 

Andrei Petcherski, Juancarlos Chan, Eimear Kenny, Hans-Michael Mueller and

Paul Sternberg

 

We have previously reported Textpresso, a web-based system to automatically retrieve and extract information from biological literature (http://www.textpresso.org). The Textpresso system is based on shallow text parsing to identify terms of interest, which have been organized into categories, in a corpus of ASCII journal abstract and article text. The parsed text is outputted as XML and searchable by either keywords and/or categories via multiple user-friendly interfaces. This tool is being generated for use as a general user-level search engine, a specialized tool for curators, and ultimately, for automatic extraction of biological facts from text.

 

We have used Textpresso to automatically mine some simple types of data from C. elegans literature. For example, Textpresso was used to locate all integrated transgenes and to extract the associations between alleles and genes. Some more complicated biological facts expressed within a single sentence have been extracted in a semi-automated fashion. For example, we have integrated Textpresso to the Wormbase curation pipeline to expedite the extraction of genetic interaction information from the literature. A prototype curation interface has been developed to enable a curator to extract data from sentences returned by a Textpresso query for genetic interaction. We found that sentences matching a Textpresso query for gene-gene interactions are enriched 3-fold compared to sentences that mention two or more gene names and 39-fold compared to random sentences from the literature. We have successfully completed one large scale experiment where 3,702 distinct gene-gene interacting pairs were extracted from the sentences of returned by Textpresso from 2,200 full text journal articles in less than 200 curator hours. The project currently focuses on C. elegans literature, however an expansion to the literature of other model organisms and biological domains is underway. The project is part of WormBase (http://www.wormbase.org) and GMOD (http://www.gmod.org).



Multiple Ontologies, Data Mining and Attaching Biology to the Genome

 

Mary Shimoyama

Rat Genome Database

 

The use of multiple ontologies allows the Rat Genome Database to not only provides comprehensive, standardized biological information on genes, QTLs and strains, but it also allows RGD to develop sophisticated search and data mining tools, as well as provide genome-wide visualization of categories of data.  The Multiple Ontology Browser, Gene Annotation Tool, GBrowse with annotation tracks and Gview will be presented and discussed.  In addition, future developments will be profiled.



Report on the 2004 BioCreAtIvE Workshop

 

Lynette Hirschman, Marc Colosimo, Alexander Yeh, Alexander Morgan

The MITRE Corporation, Bedford, MA

 

Christian Blaschke, Alfonso Valencia

Centro Nacional de Biotecnologia, Universidad Autonoma, Madrid, Spain

 

The first BioCreAtIvE Workshop (Critical Assessment of Information Extraction in Biology) was held in Granada, Spain March 28-31, 2004.  The goal on the workshop was to provide a set of common challenge evaluation tasks to assess the state of the art for text mining applied to biological problems. The assessment focused on two tasks.  The first dealt with extraction of gene or protein names from MEDLINE text, and their mapping into standardized gene identifiers for three model organism databases (fly, mouse, yeast). The second task addressed issues of functional annotation, requiring systems to provide gene ontology annotations for proteins, given full text articles.  Overall, 27 groups participated in the assessment.



The results for gene/protein name extraction showed that a number of systems (4) were able to extract general gene names from sentences at over 80% balanced precision and recall.  For the name normalization subtask, the results ranged from a high for yeast of 92% balanced precision and recall, to somewhat lower scores for fly (82%) and mouse (79%), due to extensive ambiguity among gene synonyms and overlap with standard English vocabulary.



For the functional annotation task, systems were asked to identify a segment of text as evidence for a GO annotation, given the protein. When both protein name and the GO annotation were given, several systems provided correct evidence for the GO predictions 25-30% of the time; two systems provided a much higher rate of predictions being correct (50% and 75-80%) by predicting only for high confidence cases.  When the systems were given only the protein name, the results were significantly lower (~10% for systems providing predictions for all proteins and ~30-35% for the high precision systems providing only a few answers).


 


Milwaukee, WI 2003 Continued: Proposal to Journals for a Database Section

 

Mary Polacco, MaizeGDB Curator

USDA-ARS, 203 Curtis Hall, University of Missouri, Columbia, MO 65211

 

As follow-up to last year's discussion, I will present a simple method whereby some information in the literature may be rendered more amenable to electronic tool mining and annotation.  Datatypes include: species and subspecies; cultivar/strain/stock; genbank accessions; gene  and allele names and/or symbols and, their association with genbank accessions; gene products; GO terms. Last year, during the final wrap-up session, this group had a very good discussion about the utility of such information being included in journals, as a "database section", with "tags", and provided to the community in the same manner that abstracts are currently made available. I will ask the group review the datatypes useful for their species during my 15 min interactive presentation, and work towards a strategy for implementation. The implementation is envisioned to require cooperation of major journals and literature services that currently provide electronically, citations and abstracts, to also provide the database section.