On Jan 5, 2010, at 11:55 PM, Chris Swain wrote:
> Just came across this paper in a journal I don't normally read.
> 
> http://www.almob.org/content/5/1/9

I personally wish for some reference benchmark for Tanimoto
search performance. My own tests are that linear searching
with simple range exclusion based on bit counts is very fast.
Fast enough that I'm limited by disk bandwidth.

Their implementation for linear search is

        QueryResult result = new QueryResult(query, minTanimoto);
        for(FingerprintWithName fingerprint : fingerprints)
            result.addPotentialHit(fingerprint);
        return result;

where

    public void addPotentialHit(FingerprintWithName fingerprint) throws 
CDKException
    {
        calculations++;

        if(isPotentialHit(fingerprint))
        {
                double tanimoto = query.tanimoto(fingerprint);
                if(tanimoto >= minTanimoto)
                        result.add(fingerprint);
        }
    }

and

    public boolean isPotentialHit(FingerprintWithName fingerprint)
    {
        return !(query.isTanimotoDefinetlyLessThan(fingerprint, minTanimoto));
    }

where query.isTanimotoDefinetlyLessThan for that case returns false

and 

        public double tanimoto(FingerprintWithName other) throws CDKException
        {
                double t = Tanimoto.calculate(fingerprint, 
other.getFingerprint());
                return t;
        }

and Tanimoto.calculate comes from CDK, which implements it as

    public static float calculate(BitSet bitset1, BitSet bitset2) throws 
CDKException
    {
        float _bitset1_cardinality = bitset1.cardinality();
        float _bitset2_cardinality = bitset2.cardinality();
        if (bitset1.size() != bitset2.size()) {
            throw new CDKException("Bisets must have the same bit length");
        }
        BitSet one_and_two = (BitSet)bitset1.clone();
        one_and_two.and(bitset2);
        float _common_bit_count = one_and_two.cardinality();
        return _common_bit_count/(_bitset1_cardinality + _bitset2_cardinality - 
_common_bit_count);
    }


I wrote all this to show the number of functions calls in
the implementation, plus memory allocations and deallocations
for the "bitset1.clone()" call (!).

Plus it recalculates the bitset1 cardinality for every
single call.

There should be at least a factor of 3 performance increase
in here just be cleaning up the code.

On top of that, CDK uses an "open" bitset, meaning that methods
are virtual and you're expected to call the API functions to
reach the underlying data. This class claims 2-3x better performance.

https://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/util/OpenBitSet.java

Combine those and the linear search code is likely the fastest
for the test set.

That's all just handwaving of course. ;)


                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to