On 24/06/2010 23:37, Rob Soe wrote:
> Hi all,
> I have two lists of molecules. The first list contains testing molecules
> (test.sdf) and the second contains training molecules (train.sdf).
> I would like to compare each test molecule to all the training molecule
> and calculate a corresponding Tanimoto similarity score. (I implemented
> it in my code and it is super slow as it is O(n^2) ). I use
> GetFingerprint() and Tanimoto() functions for such purpose. After the
> comparison, I picked up the kth most similar molecules to the test
> molecule and predict something for the test molecule. I am trying to
> make things a bit faster.
> So I tried the following babel command:
>
>   babel  test_mol.smi  train_mols.sdf -ofpt
>
> But the problem is that I have to manually replace each testing molecule
> smile information for every time I run the command. In other words, the
> script will only compare the first molecule in the 'test_mol.smi" to all
> the molecules in 'train_mols.sdf'. So even if I have a list of testing
> molecules in the 'test_mol.smi', the script will not compare them to the
> training molecules except the first one in the list.
> Is there a way or trick I can use so that I can compare all my testing
> molecules to the training molecules and get a list of Tc scores?
> Thanks so much for your help!

I think you are going to have to use some kind of programming - Python 
scripting or even C++ - to do things that the command line doesn't 
provide. And working with two sets of molecules is something it does 
not currently do.

If you are interested in only the closest matches to a test molecule, 
the following may be a cleaner way of doing it:
Make a fast search index from test.sdf
   babel test.sdf -ofs
For a given test molecule, find, say, the 10 molecules with the 
largest Tanimotos with it
   babel test.fs -S test_mol.sdf -aat10 result.smi
The second 'a' adds the Tanimoto to the result molecule's title.

You are still going to replace the test molecule or its file name. If 
there were not too many you can split test.sdf
   babel test.sdf test_mol.sdf -m
will put each molecule into a different file. Maybe you could then 
iterate using shell or batch scripting.

This isn't any help to you, but it happens that I am currently writing 
code to prepare and N by N matrix of Tanimoto coefficients (on the way 
to selecting a diverse set of molecules). As others have said, in 
making this matrix you can't beat O(N2) but you can make each Tanimoto 
calculation faster. Normally each Tanimoto requires two bit counts, 
but only one is needed if the bit count of each fingerprint is known. 
So these are pre-calculated, which is O(N). I'm also using a 16 bit 
look-up table for these bit counts, which is faster than the way 
currently in OB.

Chris


------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to