[Gmod-gbrowse] GBrowe is slow

Discussion:

Samy Jeevan Karloss Antony

2014-05-13 08:07:31 UTC

Hi there,

I have installed GBrowse. I am using Chado and PostGresSQL.

It became very slow after I loaded a GFF3 file with the size of ~80 MB. Which contains Gene, mRNA, CDS, 3` UTR and 5’ UTR.

Its running on 6 CPUs with 12 GB RAM (test server).

I have few questions.

1. Is that CPU and RAM enough for this data size?

2. Feature table shows that its size is ~1.2GB and the whole database size is 6GB. I feel that some of my OLD test data would not have been deleted properly. Is there way to check it? or how to delete all the instances of test data loaded into the database?

3. What is the actual background Postgres query when you search a gene in the browser? so that I can check whether my database is slow.

Thanks,
Jeevan

Scott Cain

2014-05-13 14:00:35 UTC

Permalink

Hi Jeevan,

Is there a reason you're using Chado? It is a much slower option for
driving GBrowse than Bio::DB::SeqFeature::Store. Even if you want "live"
data for one track, I'd suggest setting up a 'Store database for the rest
of the tracks and just get the live data track from Chado.

Also, using fastcgi will make a difference in speed for GBrowse regardless
of what database adaptor you're using, but it makes a big difference for
Chado, because there is some start up overhead that fastcgi caches for you,
so after the first query it's noticeably faster.

That said, there are options for tuning a PostgreSQL database to get better
performance, since it isn't particularly tuned out of the box. Have you
looked at this for instance:

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

In the past (ie, pre Postgres 9), I've had lots of success adjusting the
shared_buffers parameter, but there are other tuning parameters as well.

Now about your questions:

1. Are the cpu/ram enough? Probably.

2. While it's possible to delete feature data via GFF, it is slow and not
perfect. Typically, what I do when I need to do something like that is
identify the first feature_id of the "real" data and then delete everything
below it:

DELETE FROM feature WHERE feature_id < ?

That will cause a cascading delete, which will not only delete the feature
rows in the database but all data in linked tables as well. That said, I'm
not sure that the increase in database size is that out of the ordinary:
there are several indexes that get built when you load data, so it will
multiply the disk size by a fair bit (though I don't know off hand how
much). Additionally, the number and size of ontologies you have loaded
will have an effect as well.

3. The long running query will look something like this:

select child.feature_id, child.name, child.type_id, child.uniquename,
parent.name as pname, child.is_obsolete, childloc.fmin, childloc.fmax,
childloc.strand, childloc.locgroup,
childloc.phase, COALESCE(af.significance, af.identity, af.normscore,
af.rawscore) as score, childloc.srcfeature_id
from feature as parent
inner join
feature_relationship as fr0 on
(parent.feature_id = fr0.object_id)
inner join
feature as child on
(child.feature_id = fr0.subject_id)
inner join
featureloc as childloc on
(child.feature_id = childloc.feature_id)
left join
analysisfeature as af on
(child.feature_id = af.feature_id)
inner join featureloc as parentloc on (parent.feature_id =
parentloc.feature_id)
where parent.feature_id = $parent_id
and childloc.rank = 0
and fr0.type_id in ($partof)
and childloc.srcfeature_id = $refclass_feature_id
and parentloc.srcfeature_id = $refclass_feature_id
and child.type_id in ($typelist)

where you'd need to sub in these values:

$part_of: a comma separated list of cvterm_ids of part_of and derived_from
terms in the relationship ontology.
$typelist: a comma separated list of cvterm_ids of the sequence ontology
terms to be searched for.
$parent_id: the feature_id of the parent feature (typically of a gene or
match feature).
$refclass_feature_id: the feature_id of the reference feature (ie, the
chromosome).

This query has to run for each feature that might have subparts, so for a
given view in GBrowse, it could run a hundred times or more. This is the
main reason the Chado adaptor doesn't scale well.

Scott

On Tue, May 13, 2014 at 4:07 AM, Samy Jeevan Karloss Antony <

Post by Samy Jeevan Karloss Antony
Hi there,
I have installed GBrowse. I am using Chado and PostGresSQL.
It became very slow after I loaded a GFF3 file with the size of ~80 MB.

Which contains Gene, mRNA, CDS, 3` UTR and 5â UTR.

Post by Samy Jeevan Karloss Antony
Its running on 6 CPUs with 12 GB RAM (test server).
I have few questions.
1. Is that CPU and RAM enough for this data size?
2. Feature table shows that its size is ~1.2GB and the whole database

size is 6GB. I feel that some of my OLD test data would not have been
deleted properly. Is there way to check it? or how to delete all the
instances of test data loaded into the database?

Post by Samy Jeevan Karloss Antony
3. What is the actual background Postgres query when you search a gene in

the browser? so that I can check whether my database is slow.

Post by Samy Jeevan Karloss Antony
Thanks,
Jeevan

------------------------------------------------------------------------------

Post by Samy Jeevan Karloss Antony
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform

available

Post by Samy Jeevan Karloss Antony
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Gmod-gbrowse mailing list
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse

--
------------------------------------------------------------------------
Scott Cain, Ph. D. scott at scottcain dot
net
GMOD Coordinator (http://gmod.org/) 216-392-3087
Ontario Institute for Cancer Research

Samy Jeevan Karloss Antony

2014-05-16 09:04:41 UTC

Permalink

Hi Scott, Thanks.

I installed FastCGI.

And, I moved from Chado+PostGres to MySQL. The result is amazing. Its very fast now. Thanks again.

Jeevan

On 13 May 2014, at 16:00, Scott Cain <***@scottcain.net<mailto:***@scottcain.net>> wrote:

Hi Jeevan,

Is there a reason you're using Chado? It is a much slower option for driving GBrowse than Bio::DB::SeqFeature::Store. Even if you want "live" data for one track, I'd suggest setting up a 'Store database for the rest of the tracks and just get the live data track from Chado.

Also, using fastcgi will make a difference in speed for GBrowse regardless of what database adaptor you're using, but it makes a big difference for Chado, because there is some start up overhead that fastcgi caches for you, so after the first query it's noticeably faster.

That said, there are options for tuning a PostgreSQL database to get better performance, since it isn't particularly tuned out of the box. Have you looked at this for instance:

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

In the past (ie, pre Postgres 9), I've had lots of success adjusting the shared_buffers parameter, but there are other tuning parameters as well.

Now about your questions:

1. Are the cpu/ram enough? Probably.

2. While it's possible to delete feature data via GFF, it is slow and not perfect. Typically, what I do when I need to do something like that is identify the first feature_id of the "real" data and then delete everything below it:

DELETE FROM feature WHERE feature_id < ?

That will cause a cascading delete, which will not only delete the feature rows in the database but all data in linked tables as well. That said, I'm not sure that the increase in database size is that out of the ordinary: there are several indexes that get built when you load data, so it will multiply the disk size by a fair bit (though I don't know off hand how much). Additionally, the number and size of ontologies you have loaded will have an effect as well.

3. The long running query will look something like this:

select child.feature_id, child.name<http://child.name/>, child.type_id, child.uniquename, parent.name<http://parent.name/> as pname, child.is<http://child.is>_obsolete, childloc.fmin, childloc.fmax, childloc.strand, childloc.locgroup, childloc.phase, COALESCE(af.significance, af.identity, af.normscore, af.rawscore) as score, childloc.srcfeature_id
from feature as parent
inner join
feature_relationship as fr0 on
(parent.feature_id = fr0.object_id)
inner join
feature as child on
(child.feature_id = fr0.subject_id)
inner join
featureloc as childloc on
(child.feature_id = childloc.feature_id)
left join
analysisfeature as af on
(child.feature_id = af.feature_id)
inner join featureloc as parentloc on (parent.feature_id = parentloc.feature_id)
where parent.feature_id = $parent_id
and childloc.rank = 0
and fr0.type_id in ($partof)
and childloc.srcfeature_id = $refclass_feature_id
and parentloc.srcfeature_id = $refclass_feature_id
and child.type_id in ($typelist)

where you'd need to sub in these values:

$part_of: a comma separated list of cvterm_ids of part_of and derived_from terms in the relationship ontology.
$typelist: a comma separated list of cvterm_ids of the sequence ontology terms to be searched for.
$parent_id: the feature_id of the parent feature (typically of a gene or match feature).
$refclass_feature_id: the feature_id of the reference feature (ie, the chromosome).

This query has to run for each feature that might have subparts, so for a given view in GBrowse, it could run a hundred times or more. This is the main reason the Chado adaptor doesn't scale well.

Scott

Post by Samy Jeevan Karloss Antony
Hi there,
I have installed GBrowse. I am using Chado and PostGresSQL.
It became very slow after I loaded a GFF3 file with the size of ~80 MB. Which contains Gene, mRNA, CDS, 3` UTR and 5 UTR.
Its running on 6 CPUs with 12 GB RAM (test server).
I have few questions.
1. Is that CPU and RAM enough for this data size?
2. Feature table shows that its size is ~1.2GB and the whole database size is 6GB. I feel that some of my OLD test data would not have been deleted properly. Is there way to check it? or how to delete all the instances of test data loaded into the database?
3. What is the actual background Postgres query when you search a gene in the browser? so that I can check whether my database is slow.
Thanks,
Jeevan
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Gmod-gbrowse mailing list
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse

--
------------------------------------------------------------------------
Scott Cain, Ph. D. scott at scottcain dot net
GMOD Coordinator (http://gmod.org/) 216-392-3087
Ontario Institute for Cancer Research