CuratorsÕ Meeting

Rat Genome Database,

Medical College of Wisconsin,

Milwaukee Wisconsin.

Oct. 27th Ð28th 2003 

 

An Overview of Eukaryotic Annotation at TIGR

Roger Smith

 

¼      At TIGR both prokaryote and eukaryote curation shares the same pipeline and tools

¼      Curation is comprised of sequence curation, computational analysis and manual curation

¼      Database updates are done consistently

¼      EGC (eukaryotic genome control pipeline) identifies the areas of update by predicting and processing the genomic sequence and splice variants

¼      Curators put in the initial information about homology, cellular localization, and proteins, and then, if the paper has to be curated in depth, it goes in for long-term curation

¼      Use an in-house web based interface called MANATEE for curating gene products, alignment, Pfam domain etc; has links to SwissProt, sequence, PubMed, Prosite, and paralogous families

¼      Manatee is also linked to annotation station, which is used for manual inspection of the gene structure

 

 

Organization and presentation of biological information in the Saccharomyces Genome Database

Maria C. Costanzo

 

¼      Clear and logical navigation paths and consistent format through out the database

¼      Each report page has consistent centralized data/links and a few simple pictures

¼      Automatic literature loading from PubMed using MeSH terms; curators read all the published abstracts and only about 1/4th of the papers are read in full

¼      Constant updates as the literature is curated which also leads to the addition/revision of description and GO terms

¼      Have sequence tools where related protein info can be filled in

¼      Summary paragraphs give a detailed description of the locus

¼      Phenotype ontology is being developed

 

 

GeneDB: A Prokaryotic and Eukaryotic Genome Resource

Christiane Hertz-Fowler

 

¼      Has a joint sequencing project with TIGR; uses similar analyses

¼      Database of 16 organisms with finished and ongoing genome projects

¼      Sequence annotation curation to integrate gene predictions, protein and functions

¼      Each organism has its own home page with the basic information, location, curated annotation and predicted peptide properties

¼      Use in-house analysis tool

¼      One curator per organism, curate annotations in text format by integrating information from literature, public databases, and community feedback

¼      GO annotations, life cycle, enzymes, proteins, DNA and predicted orthologs are curated

 

 

RegulonDB: Curation, Literature Search, Notation and Evidences about Transcriptional Regulation and Transcription Unit Organization in E.coli K-12

Gama-Castro S.,

 

¼      Focus on genes, products, promoters, terminators, transcription units, regulatory proteins, effectors (small molecules which bind to proteins) etc.

¼      Predictions are made regarding regulatory interactions and promoters

¼      Data is validated in the annotation forms and is checked

¼      Curated information will be current in 2005

 

 

Integration of New Data into RGD: Quality Control and Data Submission Tools

Dean Pasko

 

¼      RGD has to integrate large datasets by informatic methods to efficiently incorporate data

¼      Bulk data pipeline was developed handle the complexity of incoming data, thus gathering information on rat genes, QTLs, SSLPs, strains, traits etc.

¼      When the data goes through the pipeline output files/flags are generated.  Conflicts are mainly related to nomenclature, sequence, alias and other attributes, which are reviewed and resolved by the curators

¼      To get data from ongoing literature curation, submission forms are used. QC checks are done in the submission forms and also when the data goes through the pipeline

¼      Updates/additions to annotations are done through notes

¼      Website is updated every two weeks

 

 

Map Curation on GrainGenes

Victoria Carollo

 

¼      Database has molecular and phenotypic information on wheat, barley, rye and oats

¼      Curators do not use any specific tools

¼      Started from 10,000 Unigenes; now are doing deletions on them

¼      Maps and mapping data are linked to the locus page

¼      Interactive maps are linked to the GrainGene database

¼      Most info is on the probe page even linked to external databases

¼      Curators contact authors to acquire raw data and extra info

¼      wEST SQL is an in-house database

¼      Barley bin maps are divided into 10cM bins

¼      Users can upload their own data into it.

 

 

Comparative Map Curation in Gramene Using CMap

Immanuel Yap 

 

¼      Gramene database covers rice, wheat, maize, barley and rye

¼      Database can be searched for specific info

¼      Gramene has 4 defined CMap displays

¼      CMap is a tool setup for comparative maps, which is comprised of the Cornell map, Japanese rice map and other physical maps.

¼      Each map has unique features and is unique to a map set

¼      A correspondence can be created between map sets and types

¼      User has the authority to change the color, width and relation in maps

¼      Plans to develop multiple feature aliases and generic attributes for all objects and types

 

 

Sequence Curation in dictyBase

P. Fey,

 

¼      Schema and the locus pages are based on the SGD layout

¼      Sequence curation is done to add additional tracks to the genome browser; contradictory data is represented on a separate track

¼      Coordinates are given to the sequence

¼      Curator page is used to collect info, each page gets a new feature number

¼      Curators work directly on the website

¼      Track the changes that are made

 

 

Apollo: a genome annotation tool

Lynn Crosby

 

¼      A tool used to annotate the genomic sequence

¼      Comprised of two types of annotations Ðviewer and editor

¼      Allows users to view large amount of data effectively and quickly

¼      Annotations can be done on various levels

¼      Data types are color-coded

¼      All data goes through the alignment program

¼      Has all EST data from the genome project

¼      Tool still under development

 

 

Clustering MeSH Representations of Medical Literature

Craig Struble  

 

¼      A collection of abstracts from medical related publications was taken and clustered

¼      Many approaches are used for document representation

¼      Descriptors and Qualifiers are used for clustering

¼      Two clear clusters were observed from the chosen papers that were sequence related and non-sequence related

¼      These can be evaluated with the difference distance method

¼      LSI/SVD method can be used to separate these papers

¼      The criteria used for clustering can be refined to represent different levels of results

¼      Classification can be based on different levels of MeSH hierarchy

 

 

Textpresso: An Information Retrieval and Extraction System for C. elegans

Literature

Wen Chen (for Eimear Kenny)

 

System Specifications

Queries

Return

Target Users

 

Article classification -> keyword search -> query -> batch retrieval

 

Biological entities: Òplugin dictionariesÓ, specific

 

Actions, facts or circumstances that relate two Entities: Òcommon SenseÓ, partially Generic

 

Auxiliary: generic

 

Text extraction pattern:  <gene><bracket><allele><bracket>

 

Future work:

  1. Increasing recall and precision

Anaphora resolution (5%-8%)

Synonym recognition

Searching in sub-sections of the paper (i.e. method, results etc)

  1. Develop Textpresso Ontology

Integrating open source ontologies (MeSH, UMLS)

Pilot study of other MODÕs (currently SGD)

  1. Package and release software
  2. Develop Fact Extraction (next target: gene-gene interaction)

 

 

PubFetch: Collecting literature from multiple data sources

Vijay Narayanasamy

 

PubFetch

o         Interface between the literature curation tools and the online literature databases, such as PubMed, Agricola, Biosis

o         Return data in PubMed MEDLINE

o         Filter Duplicates

o         Generic searching and retrieving literature data from online literature data sources

 

How PubFetch works?

  1. Search LitDb for articles matching query criteria (eg. keywords, date, author, etc), retrieve set of accession numbers (eg. PMIDs) for matching references
  2. Retrieve articles from the LitDb corresponding to the given accession numbers (eg. bring me the PubMed article for PMID 12345678)
  3. Articles returned in PubMed-MEDLINE Display Format

 

Core functionalities available as web services, following the BioMOBY service model, language independent

 

BioMOBY

A system through which a client interacts with multiple sources of biological data regardless of the underlying format or schema

 

RGD BioMOBY Services (in progress)

 

 

BioCreAtIvE: Critical Assessment of Information Extraction

Marc Colosimo (Mitre)

 

Goal: common benchmarks for the performance of natural language processing systems working on biomedical research literature

 

Task 1: entity extraction

  - assessing the ability of an automated system to identify the genes

 

Task 2: functional annotation of gene products

  - assignment of GO annotations to human proteins

 

Problems and limitations

  1. Text Source: Abstracts vs Full Article
  2. Different databases Ð different focus
  3. The gene nomenclature is constantly changing

 

 

Curatorial procedures at Mouse Genome Informatics, with an emphasis on expression data

Constance M. Smith

 

Gene Expression Database (GXD)

o        Obtained via manual curation

o        Embryonic expression data: where, when and what

o        Assay types:

  1. In situ:  RNA in situ, immunohistochemistry, reporter gene (knock in)
  2. Gels/blots:  RT-PCR, Northerns, Westerns, RNase protection, Nuclease S1
  3. cDNA source data

 

Emphasis on images: allowing users to analyze primary data

 

Allele details: description of phenotype using controlled vocabulary

 

 

Gene Expression Curation in WormBase

Wen J. Chen

 

Gene expression data

  1. Anatomical and temporal expression analysis

Reporter gene analysis (GFP, LacZ É)

Antibody staining

In situ hybridization

Northern, Western, RT PCR on staged animals

  1. Microarray/SAGE
  2. Gene regulation

Gene expression in mutant/RNAi background

Expression influenced by temperature, and chemical

 

WormBase Literature Curation:

  1. First-Pass Curation -> Second Pass Curation
  2. Jamboree or Textpresso

 

Ontology

1.       Temporal

Developmental Life Stage, 69 terms

All gene expression curation, including expression pattern, microarray and gene regulation

  1. Anatomical

Anatomy, ~5,000 terms

Future curation on expression pattern and gene regulation

Updating Old expression and gene regulation data

 

 

Biological Interaction Curation In FlyBase

Chihiro Yamada

 

Phenotype Curation

1.       Allele level

             Sequence variants; transgene constructs; RNAi experiments

             Annotated with: Phenotypic class CV; Bodypart CV; Free text

2.       Gene level interaction

Linked to allele level interaction statements or be based on author statements

3.       GO IPI and IGI

246 genes with IGI lines, 126 genes with IPI lines; 458 IGI and 228 IPI lines in total

4.       Molecular interactions

(a) 2 types of entity:

Objects (~550 so far, e.g. Notch intracellular domain)

Events (~500 so far, e.g.ÓNotchless binds to the Notch intracellular domainÓ)

- Protein/protein

- Protein/DNA

-           Protein/RNA

(b) Curated from literature

(c) Objects mapped to the genome

(d) Annotated with CVs

(e) Location/stage/GO

(f) Plans to incorporate large scale datasets: e.g. 2-Hybrid screens

 

We need to warn computational biologists:

  1. Curation methods are necessarily based on experimentation
  2.  Interpretation of these data must be based on an understanding of the limitations of experimental data

 

 

Mutant Manifests: toward a zebrafish phenotype ontology

David Fashena

 

Goal: to facilitate cross-species analysis of gene function in embryonic development

 

We want to annotate mutant phenotypes to understand not just whether a gene affects an anatomical structure, but how different mutations affect specific aspects of a structure. The sorts of cross-species queries we would like to facilitate include:

 

o       Mutations in which orthologous genes result in vertebral duplications ?

o       Mutations in which genes result in over-production of spinal motoneurons ?

o       What frequency of mutants with increased life-spans also have decreased heart-rates ?

 

Chosen format: PATO ( Phenotype and Trait Ontology )

 

Represent any phenotype as:

        ENTITY --has a-- ATTRIBUTE --has a-- VALUE

 

Entity can be an anatomical structure, process, or GO term.

Attributes and values come from a species-independent controlled vocabulary.

 

Difficulties and challenges:

 

1.      Values of structural attributes not in PATO:

Òsickle-cell likeÓ, Òresembling a raw prawnÓ

2.      Translation of complex text descriptions into concise PATO format

e.g. Òretinal axons grow halfway to their normal target in the optic tectum, then turn and make ectopic synapses in the epiphysisÓ

3.      Whether to annotate a structure or the processes that form a structure

4.      Uniformity of curation

 

 

o                      Mutations in which orthologous genes result in vertebral duplications

o                      Mutations in which genes result in over-production of spinal motoneurons

o                      Frequency of mutations with malformed inner ears with malformed kidneys

o                      Mutants with increased life-spans and decreased heart-rates

 

Chosen format: PATO Phenotype and Trait Ontology

 

Represent any phenotype as:

                  ENTITY --has a-- ATTRIBUTE --has a-- VALUE

 

Entity can be an anatomical structure, process, or GO term

 

Difficulties and challenges:

 

1.       Values of structural attributes not in PATO:

Òsickle-cell likeÓ, Òresembling a raw prawnÓ

2.       Translate text description into PATO format

Òretinal axons grow halfway to their normal target in the optic tectum, then turn and make ectopic synapses in the epiphysisÓ

3.       Structure AND/OR process that forms structure

Òdisorganized fin stripesÓ

4.       Uniformity of curation

 

 

Community Curation at MaizeGDB

Carolyn J. Lawrence

 

Why community curators?

  1. Team consists of only four full-time members.
  2. The maize community is a tight-knit group of researchers who go out of their way to work together (i.e., community curation COULD work with this group).

 

Data entered by community members

  1. Areas lacking content can be more carefully curated by MaizeGDB
  2. If data are lacking in some area, will ask the researchers to nominate an individual to make data become available

 

 

Community Interactions: Feedback, Support and Curation

Eva Huala, TIAR

 

Community Feedback by email

Uses the software Jitterbug to track user emails. Typical emails:

  1. want data
  2. donÕt understand this tool/data
  3. fix/improve tool
  4. update annotation
  5. canÕt find info
  6. Login/registration questions

 

Community Data Submission

  1. Gene Family
  2. Gene class symbol
  3. Person/Lab info
  4. Seed and DNA stock info (through ABRC)
  5. Microarray data
  6. Gene structure/function
  7. Marker data
  8. Functional genomics projects

 

Gene family submission was a big success.

 

Community Outreach

  1. Education and Outreach page
  2. Workshops

 

TAIR attends most of international meetings and has been hosting workshops.

 

 

General Discussions

Future Meetings

L Stein had proposed a joint GMOD/curator meeting; GMOD meeting with sessions for bioinformatics, database , curation, and ontologies.

 

Small group discussions for BOFs

 

Time for hands on demo of software tools

                  Tools transferable to diff databases Ð generic

                 Get other tools dev. Earlier in development

                  Cur/dev/bionf commun.

 

Example of tools on GMOD Ð sample implementation of tools , etc.

 

Best time to hold meeting?

 

How often?

 

Make better use of biocurator mail list for communication -  biocurator.org

 

Sample comments from grant review boards

(1)          Balance between stability vs. speed of updating database (Wormbase)

(2)          More understanding on significance of GO annotation (MGI)

(3)          Balance between too much vs. not enough information on home page (TAIR)

(4)          Uniform colors on home page (TAIR)

(5)          Obtain user feedback by watching a user using the web page (Flybase)

(6)          Pictures of mutants are very desirable (Flybase, Wormbase, Gramene)

(7)          Community definition:

Everyone registered (Wormbase), obtanined from literature because have all papers and use the author names.  Have full time staff to keep track. Get registration info from meeting.

 

How to get better community involvement in curation?

 

(1)   Journal requirement for species names on articles, standardized submission of gene names, database accession ID, strain ID, and Ontology terms and IDs

(2)   Species and sequence ID can be put in the Òmethods and materialÕ section of a paper. Have links to database sites

(3)   Joint letter to journals (nature yes, dev biol encourage,) to request species name in title, abs or both); Joint publication on description of nomenclature, curation, benefits to working with MODs

(4)   Switch in emphasis from Ontologies to free text (Wormbase)

(5)   Second party (curators) submission of unpublished sequence information

(6)   Encourage users to submit their cDNA sequences to genbank, even if itÕs a repeat of something thatÕs already there (sequence already on a BAC so people donÕt resubmit)