Re: [Rdkit-discuss] Speeding up database queries...

Greg Landrum Tue, 03 May 2011 20:26:58 -0700

Dear JP,

On Tue, May 3, 2011 at 6:29 PM, JP <[email protected]> wrote:
>
> Based on an rdkit post I read over the warm weekend I set myself to
> have a look at my rdkit based queries (and ways to speed them up)...
>
> But first some details:
>
> Postgres:
> PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc
> (GCC) 4.5.2, 64-bit
>
> RDKit (DB Cartridge):
> v. 0.20.0
>


The other important question is how much memory you have and what
filesystem postgres is using for the database (local or network).

> # of Molecules
> 8,432,896
>
> Database table (ligand)
>                                  Table "public.ligand"
>   Column   |         Type          |                      Modifiers
> ------------+-----------------------+-----------------------------------------------------
>  id         | integer               | not null default
> nextval('ligand_id_seq'::regclass)
>  supplierid | character varying(50) |
>  smiles     | text                  |
>  rdkitmol   | mol                   |
>  pairbv     | bfp                   |
>  torsionbv  | bfp                   |
>  morganbv   | bfp                   |
>  amw        | real                  |
>  mollogp    | real                  |
>  hba        | integer               |
>  hbd        | integer               |
>  atoms      | integer               |
>  hvyatoms   | integer               |
> Indexes:
>    "ligand_pkey" PRIMARY KEY, btree (id)
>    "idx_ligand_morganbv" gist (morganbv)
>    "idx_ligand_pairbv" gist (pairbv)
>    "idx_ligand_rdkitmol" gist (rdkitmol)
>    "idx_ligand_torsionbv" gist (torsionbv)
>
>
> I cannot explain why the following queries:
>
> db=# select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
>  count
> -------
>  2942
> (1 row)
>
> Time: 193973.253 ms
> db=# select count(*) from ligand where 
> morganbv%morganbv_fp('c1cccc2c1nncc2',2);
>  count
> -------
>     8
> (1 row)
>
> Time: 400138.989 ms

The performance is critically dependent on whether or not the indices
are in memory. If you've just started the database or if there's been
a lot of non-postgres activity on the machine since the last time you
used it, it can take a long time to load the index from disk to
memory. Once it has been loaded, things should go faster.

My emolecules database requires about 1 GB for each of the indices:

emolecules=# \di+ molidx;
                         List of relations
 Schema |  Name  | Type  |  Owner   | Table |  Size   | Description
--------+--------+-------+----------+-------+---------+-------------
 public | molidx | index | glandrum | mols  | 1049 MB |
(1 row)
emolecules=# \di+ mfp2idx;
                         List of relations
 Schema |  Name   | Type  |  Owner   | Table |  Size  | Description
--------+---------+-------+----------+-------+--------+-------------
 public | mfp2idx | index | glandrum | fps   | 970 MB |
(1 row)


>
> Take so long... these are orders of magnitude larger than timings
> reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2
> And my database in "only" roughly 50% larger (8M instead of the puny
> 5M in emolecules).
>
> When I do an "explain" on these queries (to make sure the indices are
> being used), I get:
>
> db=# explain select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
>                                          QUERY PLAN
> ----------------------------------------------------------------------------------------------
>  Aggregate  (cost=34850.44..34850.45 rows=1 width=0)
>   ->  Bitmap Heap Scan on ligand  (cost=2667.36..34829.36 rows=8433 width=0)
>         Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
>         ->  Bitmap Index Scan on idx_ligand_rdkitmol
> (cost=0.00..2665.25 rows=8433 width=0)
>               Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
> (5 rows)
>
> db=# explain select count(*) from ligand where
> morganbv%morganbv_fp('c1cccc2c1nncc2',2);
>                                                     QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=33290.88..33290.89 rows=1 width=0)
>   ->  Bitmap Heap Scan on ligand  (cost=918.05..33269.79 rows=8433 width=0)
>         Recheck Cond: (morganbv %
> '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
>         ->  Bitmap Index Scan on idx_ligand_morganbv
> (cost=0.00..915.94 rows=8433 width=0)
>               Index Cond: (morganbv %
> '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
> (5 rows)
>
> Looks good no?
> Am I missing something?  Or is this the fastest my search can go at?
> Supposedly the fingerprints search is just doing some ~8M binary
> operations no?  Why does this take so long?
> Ideas, anyone?

This all looks fine.

The experiments to try are:
1) do a second query and see if that's faster
2) try a smaller table (start with 1 or 2 million) and see how that performs

Best Regards,
-greg

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Speeding up database queries...

Reply via email to