On Jan 5, 2010, at 11:55 PM, Chris Swain wrote: > Just came across this paper in a journal I don't normally read. > > http://www.almob.org/content/5/1/9
I personally wish for some reference benchmark for Tanimoto search performance. My own tests are that linear searching with simple range exclusion based on bit counts is very fast. Fast enough that I'm limited by disk bandwidth. Their implementation for linear search is QueryResult result = new QueryResult(query, minTanimoto); for(FingerprintWithName fingerprint : fingerprints) result.addPotentialHit(fingerprint); return result; where public void addPotentialHit(FingerprintWithName fingerprint) throws CDKException { calculations++; if(isPotentialHit(fingerprint)) { double tanimoto = query.tanimoto(fingerprint); if(tanimoto >= minTanimoto) result.add(fingerprint); } } and public boolean isPotentialHit(FingerprintWithName fingerprint) { return !(query.isTanimotoDefinetlyLessThan(fingerprint, minTanimoto)); } where query.isTanimotoDefinetlyLessThan for that case returns false and public double tanimoto(FingerprintWithName other) throws CDKException { double t = Tanimoto.calculate(fingerprint, other.getFingerprint()); return t; } and Tanimoto.calculate comes from CDK, which implements it as public static float calculate(BitSet bitset1, BitSet bitset2) throws CDKException { float _bitset1_cardinality = bitset1.cardinality(); float _bitset2_cardinality = bitset2.cardinality(); if (bitset1.size() != bitset2.size()) { throw new CDKException("Bisets must have the same bit length"); } BitSet one_and_two = (BitSet)bitset1.clone(); one_and_two.and(bitset2); float _common_bit_count = one_and_two.cardinality(); return _common_bit_count/(_bitset1_cardinality + _bitset2_cardinality - _common_bit_count); } I wrote all this to show the number of functions calls in the implementation, plus memory allocations and deallocations for the "bitset1.clone()" call (!). Plus it recalculates the bitset1 cardinality for every single call. There should be at least a factor of 3 performance increase in here just be cleaning up the code. On top of that, CDK uses an "open" bitset, meaning that methods are virtual and you're expected to call the API functions to reach the underlying data. This class claims 2-3x better performance. https://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/util/OpenBitSet.java Combine those and the linear search code is likely the fastest for the test set. That's all just handwaving of course. ;) Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss