001

Literature Curation at the Mouse Genome Informatics Databases

Harold J. Drabkin for the Mouse Genome Informatics Group, The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609

The Mouse Genome Database (MGD, the Gene Expression Database (GXD), and the Mouse Tumor Biology database (MTB) are integrated components of the Mouse Genome Informatics (MGI) resource. The MGI system presents both a consensus view and extensive experimental data sets concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data is also presented. This data is collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.).

Peer-reviewed literature is collected through a library triage system whereby curators determine if there is information in a paper that falls under roughly eleven categories. Selected articles are entered into a master bibliography and indexed to particular areas of interest such as "GO" or "homology" or "mapping". Each article is then either indexed to a marker already contained in the database, or funneled through a separate nomenclature database to add new markers. The master bibliography and associated indexing provides information for various curator-reports such as "papers selected for GO that refer to genes with NO GO annotation or only IEA annotations for molecular function". Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis, and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as with reference to supporting citations.

MGD is supported by NHGRI grant HG-00330. GXD is supported by National Institute of Child Health and Human Development grant HD33745. The GO work at MGI is supported by NHGRI grant HG-02273.


002

Literature curation at the Zebrafish Information Network (ZFIN).

Ken Frazer. Zebrafish Information Network, 5291 University of Oregon, Eugene, OR 97403-5291 USA

The Zebrafish Information Network (ZFIN) serves as a centralized source for the curation and integration of zebrafish genetic, genomic, and developmental data. Information in ZFIN is derived from bulk uploads from expression and mapping research laboratories, collaborations established with other databases, limited user annotation, and from the published literature. Curation of the research literature is labor intensive, the most labor intensive method of data acquisition, yet provides a breadth and variety of information that cannot be obtained from the other data collection methods.

Literature curation at ZFIN begins with rigorous searches of online scientific literature resources utilizing a set of search terms to optimize results. Curators extract a variety of data including approved and unapproved gene and mutant nomenclature, orthology information, mapping data, and nucleotide sequence information. Descriptions of mutant phenotypes are written and annotated with keywords supplied by a controlled vocabulary. Relationships between genes and mutants are annotated and recorded in both gene and mutant records. All data entered into ZFIN is attributed to publications that are listed in the gene and mutant records. Gene and mutant nomenclature is approved by the zebrafish nomenclature committee prior to being entered into the database.


003

Literature curation at FlyBase-Cambridge.

Gillian Millburn. FlyBase (Cambridge). Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK. gm119@gen.cam.ac.uk

FlyBase-Cambridge's curation responsibilities encompass bibliographic data, links between genes and sequence databases and genetic, mapping, phenotypic and function data. The focus of this talk will be literature curation of genetic, mapping, phenotypic and function data. Topics covered will include how we decide which papers to curate, what objects we curate from each paper, tools we use for curation, data tracking and how the data we curate gets to the public. I will finish with a discussion on how we plan to deal with new controlled vocabularies, new techniques and strategies for coping with the breadth of the Drosophila literature in the most efficient manner.


004

Genomic-sequence-based curation at FlyBase-Harvard

Beverly Matthews, Flybase Harvard

FlyBase-Harvard's curation responsibilities encompass 'molecular data,' including annotation of genomic sequence, curation of expression pattern information in wild type and in mutant backgrounds, and curation of molecular interaction data. The focus of this talk will be on strategies for genome annotation and the two software tools that we use. The Apollo tool is used for annotating gene structures based on bulk-loaded EST/cDNA data, BLASTX homology, and gene prediction. JAVASEAN is a tool used for literature- based curation, and is most useful for annotating additional features that can be mapped to the DNA sequence such as mutations, rescue fragments, transposon insertion sites and regulatory elements.


005

Literature curation at the C. elegans Genomic Database (Wormbase).

Wen Chen and Andrei Petcherski for the C. elegans Genomic Database, 1200 E. California Blvd, M/C 156-29, Pasadena, CA 91125, USA

Wormbase is a centralized resource for genetic, genomic, and other biological information pertinent primarily to C. elegans, but increasingly including data about other nematodes. The information in Wormbase comes from the genome sequencing centers, personal communications from researchers, meeting abstracts, and the primary research literature. Currently, there are over 7,000 peer-reviewed publications relevant to C. elegans. For about 5,500 of these, including all the papers for the last year, we have electronic and/or hard copies immediately available.

Our curation strategy is a two-step process. First, we perform a first-pass curation whereby a curator reads a paper and flags all the relevant information. Currently, the information is represented in 28 fields. Then, the specific information is directed to the curators with expertise in a particular information type who enter the data into Wormbase, or is deposited into a cgi database to be dealt with at a later point. The routine curation strategy is periodically supplmented with jamborees to extract specific types of information, like expression patterns, from the literature spanning several years.

In the future, we plan to supplement the human curation efforts with an information extraction system Textpresso [Hans-Michael Muller, Eimear Kenny, and Paul Sternberg, personal communications] to extract certain types of the information. An information retrieval system that will be the underlying backbone for the extraction is already available at http://www.textpresso.org/.


006

Phenotype Curation Effort in Gramene Database

Junjian Ni for Gramene Database, Cornell University, Ithaca NY and CSHL, Cold Spring Harbor, NY

In September 2002, Gramene (www.gramene.org ) released a new database module comprised of classically identified phenotypic variants (marker genes) of rice. It is a curated resource providing collective information about publicly available mutant stocks in rice (Oryza sp.). It uses controlled vocabularies (ontologies) to provide descriptions of phenotypic variants and alleles associated with morphological, developmental and agronomically important phenotypes, variants of physiological or morphological characters, biochemical functions and isozymes. The current version of the database houses information on over 400 phenotypic variants curated from the published literature. The main entry point to the genes and alleles database is the "Phenotype search" link on the Gramene website. A researcher can search the database using a gene symbol, gene name or an accession number. The database also provides a full text search of the phenotype descriptions. The curation includes information about the allelic variants, the genetic background in which the alleles were observed, and the environmental conditions in which the alleles were assayed. Links to sequence, map and protein divisions are provided when feasible. We will demonstrate the tools and strategies developed for phenotype curation in Gramene. Current problems and limitations for phenotypic curation work will be also discussed during the meeting.

Funding for Gramene is provided by USDA CREES; Grant Number: 00-52100-9622 and USDA ARS Specific Cooperative Agreement; Grant Number: 58-1907-0-041


007

Literature Curation at the Rat Genome Database (RGD)

Susan K. Bromberg, Rat Genome Database, Bioinformatics Research Center, Medical College of Wisconsin, Milwaukee, WI 53226 USA

The goal of the Rat Genome Database is to collect, consolidate, and integrate rat genetic and genomic data. To do this we curate information on genes, quantitative trait loci (QTLs), microsatellite markers, rat strains, maps, and ESTs obtained from a number of different sources, including the literature, other databases, and research laboratories. While bulk curation provides large amounts of data, literature curation supplies the descriptive information needed to give depth to each record. The results and conclusions presented in papers are reported on the web page in a free-text format and divided into various topics such as gene function, expression, pathway, or strain characteristics. Additional descriptive information comes from manually curated GO terms.

Two types of literature curation efforts are underway at RGD. 1) General curation: newly published journal issues are reviewed for articles containing appropriate information on any RGD object. Articles are prioritized with the idea of adding breadth to the database. 2) Targeted curation: targeted searches are performed on both recent and older articles to select information on a particular topic. This method is followed for curation for the Disease Oriented Research Resource (DORR) project, which integrates disease and genomic information, exploiting the large body of rat physiology and disease research to study human disease.

We have been working on manual and automated methods to identify appropriate articles for both curation efforts and to prioritize the articles selected.


008 Literature and Functional Annotation at TIGR

Roger Smith, TIGR

The Institute for Genomic Research (TIGR) is involved in the annotation of a number of prokaryotic and eukaryotic genomes and annotation strategies vary across the numerous projects. In Eukaryotic Annotation, specifically the reannotation of the Arabidopsis thaliana genome, a multipronged and systematic approach was undertaken in an effort to produce a high quality, thorough, complete and consistent annotation of the proteome. Novel algorithms and custom software interfaces are used to facilitate the generation of data and a centralized automated data management system (EGC) is utilized. A web-based interface developed at TIGR, MANATEE (available at http://manatee.sourceforge.net/) allows annotators to easily access the computationally derived data and add functional information to gene products. This comprehensive tool provides annotators with the best possible information to curate gene products based on functionally characterized protein matches. Gene information as well as HMM2, Prosite, Pfam, Interpro, Blast his, and much more are shown on a single page. Summaries are displayed on the main page and more in-depth information is available through direct links. This information is examined, and high quality functional assignments, such as GO classifications are made based on associated literature citations. As the sequence of Arabidopsis was completed in late 2000 (The Arabidopsis Genome initiative, 2000), a comprehensive set of gene models was available making it possible to computationally cluster the predicted proteome into families. Annotation of individual gene products was then carried out in the context of these families. This method can highlight inconsistencies in the previous annotations and aids in achieving consistency and accuracy in the manual curation of both gene structure and function. Some of the tools and reannotation strategies utilized at TIGR will be discussed in further detail.


009

Literature Curation at the Saccharomyces Genome Database

Chandra Theesfeld, SGD, Stanford, CA.

The Saccharomyces Genome Database (SGD) provides the scientific community with the "official" Saccharomyces cerevisiae sequence, that of the strain S288C, and links to relevant information about the genes annotated to that sequence. In addition, operating under the guidelines agreed upon at national and international meetings of the yeast community, SGD is also the repository for standard yeast gene names. Each gene has a locus page from which all related information is linked, including chromosomal position, gene names and aliases, GO annotations, and literature. Each locus page includes a Literature Guide: a list of papers associated with that gene and organized into broad subject topics (e.g. Function/Process, Protein-Protein Interactions, Protein Structure, etc.), as assigned by SGD curators . Relevant publications are identified by weekly automated scripts that search electronically accessible fields in PubMed (e.g. titles, abstracts, and MeSH terms) for mention of Saccharomyces gene names or gene aliases and the words "Saccharomyces" or "cerevisiae." These papers are immediately associated with the genes and viewable by the users, but are initially listed under the topic, "Uncurated Papers." Subsequently, SGD curators read the abstract of each publication to insure that papers are appropriately linked to genes, and to sort the paper into one or more of 30 broad subject topics. At that time, curators also note gene names/aliases, update information on the locus pages, and consider using the paper for GO annotations. Because SGD uses dbi-cgi programs for these tasks, the updated information is immediately available to our users. SGD currently contains approximately 22,000 curated papers.


010

Literature Curation at TAIR

Tanya Berardini for The Arabidopsis Information Resource (TAIR)

Literature curation at TAIR is and continues to be an interesting and substantial challenge. Though TAIR is only three years old, the Arabidopsis literature has been increasing by leaps and bounds since 1990. This means that we have about 15000 "old" papers to curate, in addition to the continuous stream of literature that continues to be published. Because of this large number, we have taken a gene-centric rather than paper-centric approach to literature curation.

Our efforts have been greatly facilitated by PubSearch, our literature curation software. Matches or hits are automatically generated between papers and genes and these hits are subsequently validated by a curator. The hit summary page also contains information on other non-gene terms (like GO terms, anatomy and developmental stage terms) that occur in the same title/abstract. What do we extract from the papers? Currently: Sequence information, aliases, GO annotation, expression patterns. In limited application: genetic interactions. In testing stages: alleles and phenotypes. In addition, we compose a description/summary of salient features about the gene/gene product for the gene detail page. A large part of our effort is expended in tracking down multiple names for the same gene and the same name for multiple genes.


011

Literature Curation at the MetaCyc Metabolic Pathway Database

Cindy Krieger, Bioinformatics Research Group, SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025

The MetaCyc database (MetaCyc) is a collection of metabolic pathways in a variety of microorganisms and plants. The goal of MetaCyc is to contain well-established pathways that have been experimentally elucidated and reported in the primary scientific literature. MetaCyc contains qualitative information on metabolic pathways, enzymatic reactions, enzymes, and chemical compounds as well as literature citations. MetaCyc is primarily manually curated from the scientific literature. However, some data is collected from downloading information, such as reactions from the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) and chemicals from the National Cancer Institute (NCI) chemical database.

This talk will summarize the curation procedures for MetaCyc, including literature search and review strategies, the type of data curated, and curation priorities.


012

Literature Curation at Maize[G]DB

Collaborators: Baran Sanford (Boulder, CO), Trent Seigfried (Ames, IA), Volker Brendel (Ames, IA)

Maize genetics literature curation has evolved over the past 76 years from the listing of literature for maize genetics and the 'zealand' extractions in the Maize Genetics Cooperation Newsletter, towards direct entry into a relational database (1991). Data entry has been supported by a curation tool that automatically checks and informs about nomenclature and synonyms, but also allow lookups based on other attributes, for example, GenBank accessions or Enzyme Commission EC#. The tool also posts data entry and updates to an audit trail table, along with the date. The main point of literature curation is to integrate in a systematic manner all hypothesis-driven and experimentally confirmed data which contribute to understanding the maize genome. Information extracted from the literature typically includes any new genes, alleles, inherited traits and phenotypes, map information, gene function and expression information, including induction conditions, pathways, proteins, motifs; links to major repositories are another aspect of literature curation. The curatorial staff is highly professional, minimally at the postdoctoral level, with genetics and molecular biology training, preferably in maize. In literature curators consult authors, regarding nomenclature issues, clarification of clone library details, and access to other available data, not provided in the manuscript. All references are pre-loaded into the database electronically, and with matching of authors to persons if already in the database and hard copies of publications are acquired and maintained. The on-going transition to MaizeGDB (Sybase->Oracle) while requiring a new curation interface, will at the same time result in enhanced biological integrity of the data.