CuratorsÕ Meeting
Rat Genome Database,
Medical College of Wisconsin,
Milwaukee Wisconsin.
Oct. 27th Ð28th
2003
An Overview of Eukaryotic
Annotation at TIGR
Roger Smith
¼
At TIGR both prokaryote and
eukaryote curation shares the same pipeline and tools
¼
Curation is comprised of
sequence curation, computational analysis and manual curation
¼
Database updates are done
consistently
¼
EGC (eukaryotic genome
control pipeline) identifies the areas of update by predicting and processing
the genomic sequence and splice variants
¼
Curators put in the initial
information about homology, cellular localization, and proteins, and then, if
the paper has to be curated in depth, it goes in for long-term curation
¼
Use an in-house web based
interface called MANATEE for curating gene products, alignment, Pfam domain
etc; has links to SwissProt, sequence, PubMed, Prosite, and paralogous families
¼
Manatee is also linked to
annotation station, which is used for manual inspection of the gene structure
Organization and presentation
of biological information in the Saccharomyces Genome Database
Maria C. Costanzo
¼
Clear and logical navigation
paths and consistent format through out the database
¼
Each report page has
consistent centralized data/links and a few simple pictures
¼
Automatic literature loading
from PubMed using MeSH terms; curators read all the published abstracts and
only about 1/4th of the papers are read in full
¼
Constant updates as the
literature is curated which also leads to the addition/revision of description
and GO terms
¼
Have sequence tools where
related protein info can be filled in
¼
Summary paragraphs give a
detailed description of the locus
¼
Phenotype ontology is being
developed
GeneDB: A Prokaryotic and
Eukaryotic Genome Resource
Christiane Hertz-Fowler
¼
Has a joint sequencing
project with TIGR; uses similar analyses
¼
Database of 16 organisms with
finished and ongoing genome projects
¼
Sequence annotation curation
to integrate gene predictions, protein and functions
¼
Each organism has its own
home page with the basic information, location, curated annotation and
predicted peptide properties
¼
Use in-house analysis tool
¼
One curator per organism,
curate annotations in text format by integrating information from literature,
public databases, and community feedback
¼ GO annotations, life cycle, enzymes, proteins, DNA and predicted orthologs are curated
RegulonDB: Curation, Literature Search, Notation and Evidences about Transcriptional Regulation and Transcription Unit Organization in E.coli K-12
Gama-Castro S.,
¼
Focus on genes, products,
promoters, terminators, transcription units, regulatory proteins, effectors
(small molecules which bind to proteins) etc.
¼
Predictions are made
regarding regulatory interactions and promoters
¼
Data is validated in the
annotation forms and is checked
¼
Curated information will be
current in 2005
Integration of New Data into RGD: Quality Control and Data
Submission Tools
Dean Pasko
¼
RGD
has to integrate large datasets by informatic methods to efficiently
incorporate data
¼
Bulk
data pipeline was developed handle the complexity of incoming data, thus
gathering information on rat genes, QTLs, SSLPs, strains, traits etc.
¼
When
the data goes through the pipeline output files/flags are generated. Conflicts are mainly related to
nomenclature, sequence, alias and other attributes, which are reviewed and
resolved by the curators
¼
To
get data from ongoing literature curation, submission forms are used. QC checks
are done in the submission forms and also when the data goes through the
pipeline
¼
Updates/additions
to annotations are done through notes
¼
Website
is updated every two weeks
Map Curation on GrainGenes
Victoria Carollo
¼
Database
has molecular and phenotypic information on wheat, barley, rye and oats
¼
Curators
do not use any specific tools
¼
Started
from 10,000 Unigenes; now are doing deletions on them
¼
Maps
and mapping data are linked to the locus page
¼
Interactive
maps are linked to the GrainGene database
¼
Most
info is on the probe page even linked to external databases
¼
Curators
contact authors to acquire raw data and extra info
¼
wEST
SQL is an in-house database
¼
Barley
bin maps are divided into 10cM bins
¼
Users
can upload their own data into it.
Comparative Map Curation in
Gramene Using CMap
Immanuel Yap
¼
Gramene
database covers rice, wheat, maize, barley and rye
¼
Database
can be searched for specific info
¼
Gramene
has 4 defined CMap displays
¼
CMap
is a tool setup for comparative maps, which is comprised of the Cornell map,
Japanese rice map and other physical maps.
¼
Each
map has unique features and is unique to a map set
¼
A
correspondence can be created between map sets and types
¼
User
has the authority to change the color, width and relation in maps
¼
Plans
to develop multiple feature aliases and generic attributes for all objects and
types
Sequence Curation in dictyBase
P. Fey,
¼
Schema and the locus pages
are based on the SGD layout
¼
Sequence curation is done to
add additional tracks to the genome browser; contradictory data is represented
on a separate track
¼
Coordinates are given to the
sequence
¼
Curator page is used to
collect info, each page gets a new feature number
¼
Curators work directly on the
website
¼
Track the changes that are
made
Apollo: a genome annotation
tool
Lynn Crosby
¼
A
tool used to annotate the genomic sequence
¼
Comprised of two types of
annotations Ðviewer and editor
¼
Allows users to view large
amount of data effectively and quickly
¼
Annotations can be done on
various levels
¼
Data types are color-coded
¼
All data goes through the
alignment program
¼
Has all EST data from the
genome project
¼
Tool still under development
Clustering MeSH Representations
of Medical Literature
Craig Struble
¼
A
collection of abstracts from medical related publications was taken and
clustered
¼
Many
approaches are used for document representation
¼
Descriptors
and Qualifiers are used for clustering
¼
Two
clear clusters were observed from the chosen papers that were sequence related
and non-sequence related
¼
These
can be evaluated with the difference distance method
¼
LSI/SVD
method can be used to separate these papers
¼
The
criteria used for clustering can be refined to represent different levels of
results
¼
Classification can be based
on different levels of MeSH hierarchy
Textpresso: An Information Retrieval and Extraction
System for C. elegans
Literature
Wen Chen (for Eimear Kenny)
System
Specifications
Queries
Return
Target Users
Article classification -> keyword search -> query -> batch retrieval
Biological entities: Òplugin dictionariesÓ, specific
Actions, facts or circumstances that relate two Entities: Òcommon SenseÓ, partially Generic
Auxiliary: generic
Text extraction pattern: <gene><bracket><allele><bracket>
Future work:
Anaphora resolution (5%-8%)
Synonym recognition
Searching in sub-sections of the paper (i.e. method, results etc)
Integrating open source ontologies (MeSH, UMLS)
Pilot study of other MODÕs (currently SGD)
PubFetch: Collecting literature from multiple data sources
Vijay Narayanasamy
o
Interface between the literature curation tools
and the online literature databases, such as PubMed, Agricola, Biosis
o
Return data in PubMed MEDLINE
o
Filter Duplicates
o
Generic searching and retrieving literature data
from online literature data sources
How PubFetch works?
Core
functionalities available as web services, following the BioMOBY service model,
language independent
A
system through which a client interacts with multiple sources of biological
data regardless of the underlying format or schema
RGD
BioMOBY Services (in progress)
Marc Colosimo (Mitre)
Goal: common benchmarks for the performance of natural
language processing systems working on biomedical research literature
Task 1: entity extraction
- assessing the ability of an automated system to identify the genes
- assignment
of GO annotations to human proteins
Problems
and limitations
Curatorial procedures at Mouse Genome Informatics, with
an emphasis on expression data
Constance M. Smith
Gene Expression Database (GXD)
o Obtained via manual curation
o Embryonic expression data: where, when and what
o Assay types:
Emphasis on images: allowing users to analyze primary data
Allele details: description of phenotype using controlled vocabulary
Gene Expression Curation in WormBase
Wen J. Chen
Reporter gene analysis (GFP, LacZ É)
Antibody staining
In situ hybridization
Northern, Western, RT PCR on staged animals
Gene expression in mutant/RNAi background
Expression influenced by temperature, and chemical
WormBase Literature Curation:
1. Temporal
Developmental Life Stage, 69 terms
All gene expression curation, including expression pattern, microarray and gene regulation
Anatomy, ~5,000 terms
Future curation on expression pattern and gene regulation
Updating Old expression and gene regulation data
Chihiro
Yamada
Phenotype Curation
1. Allele level
Sequence variants; transgene constructs; RNAi experiments
Annotated with: Phenotypic class CV; Bodypart CV; Free text
2. Gene level interaction
Linked to allele level interaction statements or be based on author statements
3. GO IPI and IGI
246 genes with IGI lines, 126 genes with IPI lines; 458 IGI and 228 IPI lines in total
4. Molecular interactions
(a) 2 types of entity:
Objects (~550 so far, e.g. Notch intracellular domain)
Events (~500 so far, e.g.ÓNotchless binds to the Notch intracellular domainÓ)
- Protein/protein
- Protein/DNA
- Protein/RNA
(b) Curated from literature
(c) Objects mapped to the genome
(d) Annotated with CVs
(e) Location/stage/GO
(f) Plans to incorporate large scale datasets: e.g. 2-Hybrid screens
We need to warn computational biologists:
Mutant Manifests: toward a zebrafish phenotype ontology
Goal:
to facilitate cross-species analysis of gene function in embryonic development
We
want to annotate mutant phenotypes to understand not just whether a gene affects an anatomical structure,
but how different
mutations affect specific aspects of a structure. The sorts of cross-species
queries we would like to facilitate include:
o
Mutations in which orthologous genes result in vertebral duplications ?
o
Mutations in which genes result in over-production of spinal motoneurons ?
o
What frequency of mutants with increased life-spans also have decreased
heart-rates ?
Chosen
format: PATO ( Phenotype and Trait Ontology )
Represent
any phenotype as:
ENTITY --has a-- ATTRIBUTE --has a-- VALUE
Entity
can be an anatomical structure, process, or GO term.
Attributes
and values come from a species-independent controlled vocabulary.
Difficulties
and challenges:
1.
Values of structural attributes not in PATO:
Òsickle-cell
likeÓ, Òresembling a raw prawnÓ
2.
Translation of complex text descriptions into concise PATO format
e.g.
Òretinal axons grow halfway to their normal target in the optic tectum, then
turn and make ectopic synapses in the epiphysisÓ
3.
Whether to annotate a structure or the processes that form a structure
4. Uniformity of curation
o Mutations in which orthologous genes result in vertebral duplications
o Mutations in which genes result in over-production of spinal motoneurons
o Frequency of mutations with malformed inner ears with malformed kidneys
o Mutants with increased life-spans and decreased heart-rates
Chosen format: PATO Phenotype and Trait Ontology
Represent any phenotype as:
ENTITY
--has a-- ATTRIBUTE --has a-- VALUE
Entity can be an anatomical structure, process, or GO term
Difficulties and challenges:
1. Values of structural attributes not in PATO:
Òsickle-cell likeÓ, Òresembling a raw prawnÓ
2. Translate text description into PATO format
Òretinal axons grow halfway to their normal target in the optic tectum, then turn and make ectopic synapses in the epiphysisÓ
3. Structure AND/OR process that forms structure
Òdisorganized fin stripesÓ
4. Uniformity of curation
Community Curation at MaizeGDB
Carolyn J. Lawrence
Why community curators?
Data entered by community members
Uses the software Jitterbug to track user emails. Typical emails:
Gene family submission was a big success.
TAIR attends most of international meetings and has been hosting workshops.
L Stein had proposed a joint GMOD/curator meeting; GMOD meeting with sessions for bioinformatics, database , curation, and ontologies.
Small group discussions for BOFs
Time for hands on demo of software tools
Tools transferable to diff databases Ð generic
Get other tools dev. Earlier in development
Cur/dev/bionf commun.
Example of tools on GMOD Ð sample implementation of tools , etc.
Best time to hold meeting?
How often?
Make better use of biocurator mail list for communication - biocurator.org
Sample comments from grant
review boards
(1) Balance between stability vs. speed of updating database (Wormbase)
(2) More understanding on significance of GO annotation (MGI)
(3) Balance between too much vs. not enough information on home page (TAIR)
(4) Uniform colors on home page (TAIR)
(5) Obtain user feedback by watching a user using the web page (Flybase)
(6) Pictures of mutants are very desirable (Flybase, Wormbase, Gramene)
(7) Community definition:
Everyone registered (Wormbase), obtanined from literature because have all papers and use the author names. Have full time staff to keep track. Get registration info from meeting.
(1) Journal requirement for species names on articles, standardized submission of gene names, database accession ID, strain ID, and Ontology terms and IDs
(2) Species and sequence ID can be put in the Òmethods and materialÕ section of a paper. Have links to database sites
(3) Joint letter to journals (nature yes, dev biol encourage,) to request species name in title, abs or both); Joint publication on description of nomenclature, curation, benefits to working with MODs
(4) Switch in emphasis from Ontologies to free text (Wormbase)
(5) Second party (curators) submission of unpublished sequence information
(6) Encourage users to submit their cDNA sequences to genbank, even if itÕs a repeat of something thatÕs already there (sequence already on a BAC so people donÕt resubmit)