Hi there all (and Greg, since he is likely to answer this),
Based on an rdkit post I read over the warm weekend I set myself to
have a look at my rdkit based queries (and ways to speed them up)...
But first some details:
Postgres:
PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(GCC) 4.5.2, 64-bit
RDKit (DB Cartridge):
v. 0.20.0
# of Molecules
8,432,896
Database table (ligand)
Table "public.ligand"
Column | Type | Modifiers
------------+-----------------------+-----------------------------------------------------
id | integer | not null default
nextval('ligand_id_seq'::regclass)
supplierid | character varying(50) |
smiles | text |
rdkitmol | mol |
pairbv | bfp |
torsionbv | bfp |
morganbv | bfp |
amw | real |
mollogp | real |
hba | integer |
hbd | integer |
atoms | integer |
hvyatoms | integer |
Indexes:
"ligand_pkey" PRIMARY KEY, btree (id)
"idx_ligand_morganbv" gist (morganbv)
"idx_ligand_pairbv" gist (pairbv)
"idx_ligand_rdkitmol" gist (rdkitmol)
"idx_ligand_torsionbv" gist (torsionbv)
I cannot explain why the following queries:
db=# select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
count
-------
2942
(1 row)
Time: 193973.253 ms
db=# select count(*) from ligand where morganbv%morganbv_fp('c1cccc2c1nncc2',2);
count
-------
8
(1 row)
Time: 400138.989 ms
Take so long... these are orders of magnitude larger than timings
reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2
And my database in "only" roughly 50% larger (8M instead of the puny
5M in emolecules).
When I do an "explain" on these queries (to make sure the indices are
being used), I get:
db=# explain select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
QUERY PLAN
----------------------------------------------------------------------------------------------
Aggregate (cost=34850.44..34850.45 rows=1 width=0)
-> Bitmap Heap Scan on ligand (cost=2667.36..34829.36 rows=8433 width=0)
Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
-> Bitmap Index Scan on idx_ligand_rdkitmol
(cost=0.00..2665.25 rows=8433 width=0)
Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
(5 rows)
db=# explain select count(*) from ligand where
morganbv%morganbv_fp('c1cccc2c1nncc2',2);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Aggregate (cost=33290.88..33290.89 rows=1 width=0)
-> Bitmap Heap Scan on ligand (cost=918.05..33269.79 rows=8433 width=0)
Recheck Cond: (morganbv %
'\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
-> Bitmap Index Scan on idx_ligand_morganbv
(cost=0.00..915.94 rows=8433 width=0)
Index Cond: (morganbv %
'\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
(5 rows)
Looks good no?
Am I missing something? Or is this the fastest my search can go at?
Supposedly the fingerprints search is just doing some ~8M binary
operations no? Why does this take so long?
Ideas, anyone?
Many Thanks
JP
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network
management toolset available today. Delivers lowest initial
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss