On Sat, Nov 27, 2010 at 9:47 AM, [email protected] <[email protected]> wrote:
> On Sat, Nov 27, 2010 at 6:49 AM, Greg Landrum <[email protected]> wrote:
>> At the moment there isn't a particularly satisfying way of doing an
>> equality search aside from adding a smiles column to the database and
>> just doing a straight equality search on that.
>
> Ok.
>
>> To that end it's probably useful to know that the smiles generated by
>> the cartridge when you convert a molecule to text is canonical.
>
> If I'm not getting fooled, it seems the structure is also stored in
> canonical format; e.g if I store:
>
> 'COc(cc1)ccc1C#N'
>
> then I "select * from molecules;" I get back 'COc1ccc(C#N)cc1'

it's not quite that straightforward. The molecules are stored in a
blob column in the standard RDKit binary form (what you get when you
use the .ToBinary() method in Python). The cartridge provides a rule
that can cast from this type to a string; this is done by loading the
binary object and then generating canonical smiles from it.

> If this is correct I should be able to search with the "=" operator
> directly, provided I prepare the query smilles with Chem.CanonSmiles,
> isn't it?

You actually don't even have to do that, simply doing
"COc(cc1)ccc1C#N"::mol::text will give you the canonical smiles

> That would avoid adding a specific smiles column.

yeah, but it is, unfortunately, very expensive since the canonical
smiles will be generated for each database molecule at query time.
Here's an example I just ran querying the chembl example database
(214K rows):

chembl=# select * from mols where m<@'CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C'
and m@>'CC(=O)c1ccc2c(c1)C(=O)C(=O)N2' and
m::text='CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C'::mol::text;
 regno  |               m
--------+--------------------------------
 246028 | CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C
(1 row)

Time: 34.449 ms

chembl=# select * from mols where
m::text='CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C'::mol::text;

regno  |               m
--------+--------------------------------
 246028 | CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C
(1 row)

Time: 175219.805 ms

I think the argument against the second query is clear. :-)

>>
>> Without adding the smiles column, another option that should be
>> correct, though it's somewhat ugly, is:
>> select * from mols where m<@'CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C' and
>> m@>'CC(=O)c1ccc2c(c1)C(=O)C(=O)N2' and
>> m::text='CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C'::mol::text;
>>
>> If the molecule column is indexed, this will use the index so it's
>> actually reasonably efficient. If you don't care about stereochemistry
>> you can leave the last bit (SMILES comparison) out.
>>
>
> Yeah, ugly but I just tried and it actually works.

glad to hear it.

>> Having a less ugly way of doing equality querying would be useful;
>> that would be a good feature request.
>
> Ok, so where should I report it ? ;-)

now that's an easy one:
http://sourceforge.net/tracker/?group_id=160139&atid=814653

-greg

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to