Very very interesting. Thank you very much indeed for all the information. Best regards.
> Hello, > > I'd say that the reason for choosing this storage method was a technical > decision. Since an unfolded FP2 is 1024 bits long (1021 actually used) it > doesn't fit into the largest integer datatype of MySQL, UNSIGNED BIGINT > which is 2^64. So you either have to store it in a BLOB, but then you have > to deal with BLOB input/output and cannot use the database's own bit > operators but have to develop your own, like Mychem does. > > So they decided to split the fingerprint into chunks that can be stored in > MySQL native datatypes, which gives 32 unsigned 32 bit integers (why they > store them in BIGINT columns and waste storage space escapes me, maybe > because MySQL's bit operators always use BIGINT internally). As you can > see on page 2945, this has one advantage: columns that are 0 (i.e. have no > bits set) can be completely omitted in the query. But the resulting SQL is > somewhat ugly, with different columns of the mol_fp table used, depending > on the query molecule. I guess that's why they have written the > gen_search_sql() function that generates the correct query SQL string for > you. > > Still, this is a linear scan on the mol_fp table but one second for > searching benzene (constrained to 900 results) on 10^6 molecules is ok. > Keeping the query as close to the database as possible is a viable design > decision, but without knowing the plans the optimizer makes out of those > queries and knowing the typical queries this system should handle, nobody > can say if that strategy is 'best'. Or how this system will scale under > multiuser load btw. All I can say is that the SQL you have to use is ok > for use in a program but not if you want to search with a manually typed > SQL. ;-> > > Maybe Jerome can comment more on this. > > Also, as seen on page 2946, the matching itself is done outside MySQL with > pybel. So this is not a fully integrated system, it's a me-too of MolDB4, > done a bit different. > > I doubt the use of MD5ed canonical SMILES for exact searching. Certainly > this works, but why not use the InChI-Key for better data > interoperability? > > And we are slowly leaving 'openbabel-discuss' towards > 'how-to-build-a-chemical-database-discuss'. :-) > > Best regards, > > Ernst-Georg > > ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss