Tag Archives: genome annotation

Ncrassa v5 annotation released

The Missing PieceAs an update to previous post, the N. crassa annotation has been updated to version 5 on the Broad Institute website. Previously the data was not yet available for this update, but as of 8-Mar-2011 it is.  The assembly hasn’t changed but the annotation is updated and includes some fixes to improperly renamed locus names.  On the N. crassa genome site you can see files with the history of loci through this to determine if a locus name was improperly changed in the past. This should be rectified in the currently released annotation, and definitely encourage you to take it for a spin and report back to the Broad Institute if you have any questions.

Neurospora annotation update (v5)

Here is a message from the Broad Institute about a gene annotation update that was made recently in response to an issue that was revealed in the June 2010 release.  This new version is called V5 and should be on its way to GenBank.

Dear Neurospora scientists,

Recently we discovered an issue with the way locus tags were assigned
to our most recent Neurospora gene set, released publicly on the Broad
website in June of 2010. Many genes in this gene set have mismatched
locus numbers compared to the same genes released in February 2010.
Adding to the confusion, both releases were labeled version 4.

To remedy this we have recalled the June locus numbers and released a
new, version 5 gene set. Genes in this set have been numbered to
preserve historical locus numbers (back to the original genbank
release) as much as possible.

Folks who call their favorite genes by their v1, v2 or v3 numbers can
search for them on our web page, which will map them to v5
automatically and accurately. The same will work for most v4 numbers.
Unfortunately, 863 genes have different locus tags in the two v4
releases. If you search for one of them, you will get two hits - the
v5 gene that the February edition mapped to, and the v5 gene that the
June edition mapped to.

Two examples to clarify:

A. Suppose you search for NCU11713.4 on our web page. This query will
retrieve two genes, NCU11688.5 and NCU11713.5. The gene which in the
February release was called NCU11713.4 is the same as NCU11688.5,
while the gene labeled NCU11713.4 in June is the same as NCU11713.5.

B. Searching for NCU11324.4 yields but one hit because that gene, like
most genes, was consistently numbered between the two releases labeled
4.

If you are not sure when you downloaded your genes, the following may
help. If you see any of these locus numbers in your gene set:

NCU00129.4, NCU00457.4, NCU00499.4, NCU00556.4, NCU00627.4,
NCU00685.4, NCU00768.4, NCU00856.4, NCU00986.4, NCU01064.4,
NCU01065.4, NCU01282.4, NCU01299.4, NCU01300.4, NCU01483.4,
NCU01559.4, NCU01560.4, NCU01610.4, NCU01611.4, NCU01664.4,
NCU01665.4, NCU01871.4, NCU01903.4, NCU02200.4, NCU02259.4,
NCU02666.4, NCU02758.4, NCU02837.4, NCU02998.4, NCU03047.4,
NCU03206.4, NCU03773.4, NCU04239.4, NCU04240.4, NCU04518.4,
NCU04519.4, NCU04710.4, NCU04711.4, NCU05275.4, NCU05512.4,
NCU05776.4, NCU06013.4, NCU06370.4, NCU06732.4, NCU07107.4,
NCU07259.4, NCU07260.4, NCU07301.4, NCU07405.4, NCU07856.4,
NCU07857.4, NCU08090.4, NCU08182.4, NCU08323.4, NCU08332.4,
NCU09085.4, NCU09256.4, NCU09257.4, NCU09998.4, NCU10166.4,
NCU10574.4, NCU11040.4, NCU11240.4, NCU11253.4, NCU11376.4,
NCU11390.4, NCU11393.4

then your genes are from the February 2010 gene set. However, if you see

NCU00082.4, NCU00083.4, NCU00084.4, NCU00085.4, NCU00516.4,
NCU01819.4, NCU04299.4, NCU04300.4, NCU04301.4, NCU04302.4,
NCU04303.4, NCU04304.4, NCU04305.4, NCU05000.4, NCU05111.4,
NCU05112.4, NCU05113.4, NCU05114.4, NCU05115.4, NCU05116.4,
NCU05448.4, NCU05452.4, NCU06667.4, NCU07323.4, NCU09066.4,
NCU10179.4, NCU10301.4, NCU10379.4, NCU10383.4, NCU10753.4,
NCU10866.4, NCU10914.4, NCU11068.4, NCU11182.4, NCU12157.4,
NCU12158.4, NCU12159.4, NCU12160.4, NCU12161.4, NCU12162.4,
NCU12163.4, NCU12164.4, NCU12165.4, NCU12166.4, NCU12167.4,
NCU12168.4, NCU12169.4, NCU12170.4, NCU12171.4, NCU12172.4,
NCU12173.4, NCU12174.4, NCU12175.4, NCU12176.4, NCU12177.4,
NCU12178.4, NCU12179.4, NCU12180.4, NCU12181.4, NCU12182.4,
NCU12183.4, NCU12184.4, NCU12185.4, NCU12186.4, NCU12187.4, NCU12188.4

then your genes are from the June 2010 release.

Attached please find five mapping tables which can be used to migrate
locus numbers from any of the previous releases to the latest version
5 locus tags (linked below).

We apologize for any confusion this may cause.
Love,
The Broad Institute

I’ve also uploaded the locus update files which maps between versions of the annotation.

Yes, Ecology can improve Genomics

Blogging on Peer-Reviewed ResearchFew organisms are as well understood at the genetic level as Saccharomyces cerevisiae. Given that there are more yeast geneticists than yeast genes and exemplary resources for the community (largely a result of their size), this comes as no surprise. What is curious is the large number of yeast genes for which we’ve been unable to characterize. Of the ~6000 genes currently identified in the yeast genome, 1253 have no verified function (for the uninclined, this is roughly 21% of the yeast proteome). Egads! If we can’t figure this out in yeast, what hope do we have in non-model organisms?Lourdes Peña-Castillo and Timothy R. Hughes discuss this curious observation and its cause in their report in Genetics.

Continue reading Yes, Ecology can improve Genomics

Fusarium graminearum genome published

The genome of the wheat and cereal pathogen Fusarium graminearum was published in Science this week in an article entitled “The Fusarium graminearum Genome Reveals a Link Between Localized Polymorphism and Pathogen Specializationtion”. The project was a collaboration of many different Fusarium research groups. The genome sequencing was spearheaded by the Broad Institute at Harvard and MIT and is part of a larger project to sequence several different species of Fusarium. The group sequenced a second strain in order to identify polymorphisms.

Some of the key findings

  • The presence of Repeat Induced point-mutation (RIP) has likely limited the amount of repetitive and duplicated sequences in the genome
  • Most of the genes unique to F. graminearum (and thus not present in 4 other Fusarium spp genomes) are found in the telomeres
  • Between the sequenced strains SNP density ranged from 0 to 17.5 polymorphisms per kb.
  • Some of the genes expressed uniquely during plant infection (408 total) include known virulence factors and many plant cell-wall degrading enzymes.
  • The genes showing some of the highest SNP diversity tended to be unique to Fusarium and often unique to F. graminearum

Yeast genome: Known knowns, and known unknowns

From Genetics this week a review discusses Why are there still 1000 Uncharacterized Yeast genes? Poor Yeast – so many more genes have no known function, while S. pombe has nearly 100% coverage in functional annotation. I’ll also point out that the 1000 genes refers to protein-coding genes, not ncRNA genes which may mean that there is alot more that is unknown.

I think this sentence from the abstract hits the nail on the head.

Notably,the uncharacterized gene set is highly enriched for genes whose only homologs are in other fungi. Achieving a full catalog of yeast gene functions may require a greater focus on the life of yeast outside the laboratory.

Continue reading Yeast genome: Known knowns, and known unknowns

Orthology detection software

Blogging about Peer-Reviewed Research A paper in PLoS One, Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes, reports a new approach to assess the performance of automated orthology detection. These authors also wrote the OrthoMCL (2006 DB paper, 2003 algorithm paper) which uses MCL to build orthologous gene families. The authors discuss the trade-offs between highly sensitive specific tree-based methods and fast but less sensitive approaches of the Best-Reciprocal-Hits from BLAST or FASTA or some of the hybrid approaches. The authors employ Latent Class Analysis (LCA) to aid in “evaluation and optimization of a comprehensive set of orthology detection methods, providing a guide for selecting methods and appropriate parameters”. LCA is also the statistical basis for feature choice in combing gene predictions into a single set of gene calls in GLEAN written by many of the same authors including Aaron Mackey.

I’ve been reading a lot of orthology and gene tree-species tree reconcilation papers lately, some are listed in Ian Holmes’s group as well as listing some of the software on the BioPerl site. This also follows with on our Phyloinformatics hackathon work which we are trying to formalize in some more documentation for phyloinformatics pipelines to support some of the described use cases. I’m also applying some of this to a tutorial I’m teaching at ISMB2007 this summer.

That was a lot of work

I’ve never worked with Magnaporthe grisea, the fungus responsible for rice blast, one of the most devastating crop diseases, but I do know that its life cycle is complicated and that knocking out roughly 61% of the genes in the genome and evaluating the mutant phenotype to infer gene function is not trivial. In their recent letter to Nature, Jeon et al did what many of us have dreamed of doing in our fungus of interest: manipulate every gene to find those that contribute to a phenotype of interest.

In their study, the authors looked for pathogenecity genes. Interestingly, the defects in appressorium formation and condiation had the strongest correlation with defects pathogenicity, suggesting that these two developmental stages are crucial for virulence. Ultimately, the authors identify 203 loci involved in pathogenecity, the majority of which have no homologous hits in the sequence databases and have no clear enriched GO functions. Impressively, this constitutes the largest, unbiased list of pathogenecity genes identified for a single species (though so of us, I’m sure, may have a problem with the term “unbiased”).

If you’d like to play with their data, the authors have made it available in their ATMT Database.

Approaching 100% coverage for GO assignments in S.pombe

A paper by Martin Aslett and Val Wood indicate that the fission yeast community is approaching 100% coverage of a GO annotation for every gene in the S. pombe genome. Only Ashbya gossypii has a smaller genome in the fungi (see a recent paper on Ashbya annotation database) and doesn’t yet have complete GO coverage. This is quite remarkable and a great dataset for studies in S. pombe and all fungi.

S. pombe taken from Paul Young’s site

My quick predictions of genes a closely related species, S. japonicus, has more than twice as many genes as S. pombe (but be over-prediction by ab initio predictors). Taken in comparison to many other fungi, S. pombe represents a streamlined and reduced genome which probably occured indepdently from reduction in the Hemiascomycetes.