Discussion:
[Gmod-gbrowse] Search on load_id attribute (Ensembl genomes)?
Sam Hokin
2015-03-04 04:34:39 UTC
Permalink
Hi, folks, just joined the list after spending a lot of time with GBrowse the past few months. My question: Is there any way to get
search to search on the load_id attribute?

Annotated genes that I get from the Ensembl plant database (I'm working with the maize AGPv3 genome, aka B73_RefGen_v3) do not have
a Name attribute in the GFF3 file. Here's a typical line from Zea_mays.AGPv3.24.gff3 (swapping spaces for tabs):

10 ensembl gene 28054052 28054433 . + . ID=gene:GRMZM5G846142;assembly_name=AGPv3;biotype=protein_coding;description=Uncharacterized
protein [Source:UniProtKB/TrEMBL%3BAcc:K7TYI5];logic_name=genebuilder;version=1

A lot of wonderful info there in the attributes, but no Name is provided. This gets loaded into my database (using
bp_seqfeature_load.pl) with load_id=gene:GRMZM5G846142 and no Name. If one searches for "GRMZM5G846142" or "gene:GRMZM5G846142" in
the browser, one gets Not Found. (Oddly, Ensembl does provide a Name attribute for exons; but not genes, transcripts or CDS, at
least in the current version - and by the way, the color-coded GFF3 parent-child relationship display works wonderfully with those!)

Yes, I can futz with the GFF file, or run a SQL script to insert a record in name for all load_id records in attributes that don't
have one in name, and all sorts of things, but I thought maybe there's a way to avoid that in a conf file. Ensembl is a pretty big
source of genome data, so I'd think this would come up a lot.
Timothy Parnell
2015-03-05 18:03:58 UTC
Permalink
Hi Sam,

Have you taken a look at the search options described here? http://gmod.org/wiki/GBrowse_2.0_HOWTO#Database_Search_Options

With that said, I don’t think load_id is automatically searched. The GFF ID attribute is used primarily to link child - parent features only (well, as far the BioPerl loader/parser is concerned). Once the GFF file is parsed by the adaptor upon loading and relationships established, it is all but forgotten, being stored as the load_id attribute. The Bio::DB::SeqFeature::Store adaptor, which is what is typically used by GBrowse, does not provide built-in methods to search for this attribute specifically (although it is possible using the Perl API and custom modified feature search). The GBrowse search field is, I believe, limited to display_name, feature type, and Note (description) searches (I’ve tried spelunking in the code, but it’s pretty deep without a lot of documentation, so I may not be completely right).

I have always modified my GFF3 files to duplicate the ID attribute as the Name or Alias attribute to avoid this issue. It’s unfortunate that the ID is sometimes treated as equal to display_name but not universally so.

Good luck,
Tim

On Mar 3, 2015, at 9:34 PM, Sam Hokin <***@carnegiescience.edu<mailto:***@carnegiescience.edu>> wrote:

Hi, folks, just joined the list after spending a lot of time with GBrowse the past few months. My question: Is there any way to get
search to search on the load_id attribute?

Annotated genes that I get from the Ensembl plant database (I'm working with the maize AGPv3 genome, aka B73_RefGen_v3) do not have
a Name attribute in the GFF3 file. Here's a typical line from Zea_mays.AGPv3.24.gff3 (swapping spaces for tabs):

10 ensembl gene 28054052 28054433 . + . ID=gene:GRMZM5G846142;assembly_name=AGPv3;biotype=protein_coding;description=Uncharacterized
protein [Source:UniProtKB/TrEMBL%3BAcc:K7TYI5];logic_name=genebuilder;version=1

A lot of wonderful info there in the attributes, but no Name is provided. This gets loaded into my database (using
bp_seqfeature_load.pl) with load_id=gene:GRMZM5G846142 and no Name. If one searches for "GRMZM5G846142" or "gene:GRMZM5G846142" in
the browser, one gets Not Found. (Oddly, Ensembl does provide a Name attribute for exons; but not genes, transcripts or CDS, at
least in the current version - and by the way, the color-coded GFF3 parent-child relationship display works wonderfully with those!)

Yes, I can futz with the GFF file, or run a SQL script to insert a record in name for all load_id records in attributes that don't
have one in name, and all sorts of things, but I thought maybe there's a way to avoid that in a conf file. Ensembl is a pretty big
source of genome data, so I'd think this would come up a lot.


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Gmod-gbrowse mailing list
Gmod-***@lists.sourceforge.net<mailto:Gmod-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse

Loading...