Dear JP,
On Tue, May 3, 2011 at 6:29 PM, JP <[email protected]> wrote:
>
> Based on an rdkit post I read over the warm weekend I set myself to
> have a look at my rdkit based queries (and ways to speed them up)...
>
> But first some details:
>
> Postgres:
> PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc
> (GCC) 4.5.2, 64-bit
>
> RDKit (DB Cartridge):
> v. 0.20.0
>
The other important question is how much memory you have and what
filesystem postgres is using for the database (local or network).
> # of Molecules
> 8,432,896
>
> Database table (ligand)
> Table "public.ligand"
> Column | Type | Modifiers
> ------------+-----------------------+-----------------------------------------------------
> id | integer | not null default
> nextval('ligand_id_seq'::regclass)
> supplierid | character varying(50) |
> smiles | text |
> rdkitmol | mol |
> pairbv | bfp |
> torsionbv | bfp |
> morganbv | bfp |
> amw | real |
> mollogp | real |
> hba | integer |
> hbd | integer |
> atoms | integer |
> hvyatoms | integer |
> Indexes:
> "ligand_pkey" PRIMARY KEY, btree (id)
> "idx_ligand_morganbv" gist (morganbv)
> "idx_ligand_pairbv" gist (pairbv)
> "idx_ligand_rdkitmol" gist (rdkitmol)
> "idx_ligand_torsionbv" gist (torsionbv)
>
>
> I cannot explain why the following queries:
>
> db=# select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
> count
> -------
> 2942
> (1 row)
>
> Time: 193973.253 ms
> db=# select count(*) from ligand where
> morganbv%morganbv_fp('c1cccc2c1nncc2',2);
> count
> -------
> 8
> (1 row)
>
> Time: 400138.989 ms
The performance is critically dependent on whether or not the indices
are in memory. If you've just started the database or if there's been
a lot of non-postgres activity on the machine since the last time you
used it, it can take a long time to load the index from disk to
memory. Once it has been loaded, things should go faster.
My emolecules database requires about 1 GB for each of the indices:
emolecules=# \di+ molidx;
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+--------+-------+----------+-------+---------+-------------
public | molidx | index | glandrum | mols | 1049 MB |
(1 row)
emolecules=# \di+ mfp2idx;
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+---------+-------+----------+-------+--------+-------------
public | mfp2idx | index | glandrum | fps | 970 MB |
(1 row)
>
> Take so long... these are orders of magnitude larger than timings
> reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2
> And my database in "only" roughly 50% larger (8M instead of the puny
> 5M in emolecules).
>
> When I do an "explain" on these queries (to make sure the indices are
> being used), I get:
>
> db=# explain select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------
> Aggregate (cost=34850.44..34850.45 rows=1 width=0)
> -> Bitmap Heap Scan on ligand (cost=2667.36..34829.36 rows=8433 width=0)
> Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
> -> Bitmap Index Scan on idx_ligand_rdkitmol
> (cost=0.00..2665.25 rows=8433 width=0)
> Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
> (5 rows)
>
> db=# explain select count(*) from ligand where
> morganbv%morganbv_fp('c1cccc2c1nncc2',2);
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=33290.88..33290.89 rows=1 width=0)
> -> Bitmap Heap Scan on ligand (cost=918.05..33269.79 rows=8433 width=0)
> Recheck Cond: (morganbv %
> '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
> -> Bitmap Index Scan on idx_ligand_morganbv
> (cost=0.00..915.94 rows=8433 width=0)
> Index Cond: (morganbv %
> '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
> (5 rows)
>
> Looks good no?
> Am I missing something? Or is this the fastest my search can go at?
> Supposedly the fingerprints search is just doing some ~8M binary
> operations no? Why does this take so long?
> Ideas, anyone?
This all looks fine.
The experiments to try are:
1) do a second query and see if that's faster
2) try a smaller table (start with 1 or 2 million) and see how that performs
Best Regards,
-greg
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network
management toolset available today. Delivers lowest initial
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss