Hi, > Hello, > > I'd say that the reason for choosing this storage method was a > technical decision. Since an unfolded FP2 is 1024 bits long (1021 > actually used) it doesn't fit into the largest integer datatype of > MySQL, UNSIGNED BIGINT which is 2^64. So you either have to store it > in a BLOB, but then you have to deal with BLOB input/output and cannot > use the database's own bit operators but have to develop your own, > like Mychem does.
In fact, the data stored in the database is longer than 1024, as the length of the binary string is stored as an unsigned long int (64 bits) before the fingerprint. So, a fp2 object is 136 bytes long. The binary string is stored in a blob, but can also be stored in a varbinary (max 255 bytes) When using blob, a tanimoto search against 1M compounds takes less than 2s with Mychem on a simple desktop. When using varbinary, the query takes around 1.6s. Andrew Dalke sent me an email describing faster algorithms for computing Tanimoto scores. Some of them are 20 times faster. I will implement one of them in the next release of Mychem. Having only one field for such a small object (136 bytes) is certainly the simplest and the most optimizable way. > So they decided to split the fingerprint into chunks that can be > stored in MySQL native datatypes, which gives 32 unsigned 32 bit > integers (why they store them in BIGINT columns and waste storage > space escapes me, maybe because MySQL's bit operators always use > BIGINT internally). As you can see on page 2945, this has one > advantage: columns that are 0 (i.e. have no bits set) can be > completely omitted in the query. But the resulting SQL is somewhat > ugly, with different columns of the mol_fp table used, depending on > the query molecule. I guess that's why they have written the > gen_search_sql() function that generates the correct query SQL string > for you. > > Still, this is a linear scan on the mol_fp table but one second for > searching benzene (constrained to 900 results) on 10^6 molecules is > ok. Keeping the query as close to the database as possible is a > viable design decision, but without knowing the plans the optimizer > makes out of those queries and knowing the typical queries this system > should handle, nobody can say if that strategy is 'best'. Or how this > system will scale under multiuser load btw. All I can say is that the > SQL you have to use is ok for use in a program but not if you want to > search with a manually typed SQL. ;-> > > Maybe Jerome can comment more on this. > > Also, as seen on page 2946, the matching itself is done outside MySQL > with pybel. So this is not a fully integrated system, it's a me-too > of MolDB4, done a bit different. > > I doubt the use of MD5ed canonical SMILES for exact searching. > Certainly this works, but why not use the InChI-Key for better data > interoperability? InChI-Key are exactly done for exact searching. > And we are slowly leaving 'openbabel-discuss' towards > 'how-to-build-a-chemical-database-discuss'. :-) It is also an interesting topics. It was also one of the goal for the Chemisql project. > Best regards, > > Ernst-Georg Best regards, Jerome ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss