On Nov 14, 2011, at 1:47 PM, Ernst-Georg Schmid wrote:
>> Since an unfolded FP2 is 1024 bits long (1021 
>> actually used) it doesn't fit into the largest integer datatype of 
>> MySQL, UNSIGNED BIGINT which is 2^64. So you either have to store it
>> in a BLOB, but then you have to deal with BLOB input/output and cannot
>> use the database's own bit operators but have to develop your own,
>> like Mychem does.

In my cyclops-mysql package

  
http://www.dalkescientific.com/writings/diary/archive/2010/10/03/cyclops_mysql_jquery_and_marvin.html

I store the fingerprints as a hex-encoded string. This obviously
takes up twice as much space as a denser blob encoding, but has
the advantage that you can look at it if needed.

It also means I don't have to worry about database int sizes or
int operations. In fact, since I was using the PubChem fingerprints,
of size 881 bits, I could put the result in a string of size 221.

I, like Mychem, wrote my own popcount and Tanimoto routines for
working with hex encoded fingerprints.


On Nov 15, 2011, at 1:24 PM, Jérôme Pansanel wrote:
> When using blob, a tanimoto search against 1M compounds takes less than
> 2s with Mychem on a simple desktop.

I reported on my performance numbers in
  http://dalkescientific.com/writings/CUP2009.pdf

On my laptop in 2009 I did 130,000 Tanimoto tests per second.

Looking at the performance numbers on this laptop, with my newest
code base, it's about 275,000 per second for 4096 bit fingerprints.
A 1024 fingerprint should be about 4x faster, so it's about the
same performance as what Jérôme reports.

I bring this up to suggest that hex encoding, while it seems
pretty slow over a byte-blob or several integer columns, has
the advantage of being readable and easily importable into
other software. And it's not appreciably slow.

On Nov 14, 2011, at 1:47 PM, Ernst-Georg Schmid wrote:
>> I doubt the use of MD5ed canonical SMILES for exact searching.
>> Certainly this works, but why not use the InChI-Key for better data 
>> interoperability?

 - SMILES can interoperate with many more tools
 - Perhaps they have a preferred charge form or tautomer form?

>> And we are slowly leaving 'openbabel-discuss' towards
>> 'how-to-build-a-chemical-database-discuss'. :-)

I think that would be a good meeting topic someday. :)



                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to