Hi Jochen,

On Mar 3, 2012, at 2:59 AM, Jochen Schreiber wrote:
> My problem now is that babel only one file expected and i have the complete 
> pubchem compound sdf files with over 2000 sdf files.
> 
> If i concat all sdf files to one file it has over 187 GB and the index will 
> be over 2GB.

You might be interested in chemfp ( http://code.google.com/p/chem-fingerprints/ 
).

I've been doing some work recently to see if it will handle all of PubChem.

It doesn't. The problem is that it also has a 2GB limit. On the mailing list I 
posted one workaround. I've put a copy at the end of this email.

With it I'm able to do k-nearest Tanimoto searches with subsecond performance. 
chemfp uses a number of optimizations that aren't in the OpenBabel search code.

                                Andrew
                                da...@dalkescientific.com



import chemfp
from chemfp import fps_search

class MultiArena(object):
   def __init__(self, arenas):
       self.arenas = arenas
       self._size = sum(map(len, arenas))

   def __len__(self):
       return self._size

   def __iter__(self):
       for arena in self.arenas:
           for x in arena:
               yield x

   def __getitem__(self, i):
       assert i >= 0, i
       for arena in self.arenas:
           if i >= len(arena):
               i -= len(arena)
           else:
               return arena[i]
       raise IndexError(i)

   def count_tanimoto_hits_fp(self, fp, threshold=0.7):
       return sum(arena.count_tanimoto_hits_fp(fp, threshold) for arena in 
self.arenas)

   def knearest_tanimoto_search_fp(self, fp, k=3, threshold=0.7):
       # The only way to merge the k-nearest values is to extract
       # everything and turn them into FPSSearchResults
       search_results = []
       for arena in self.arenas:
           arena_results = arena.knearest_tanimoto_search_fp(fp, k, threshold)
           search_results.extend(arena_results.get_ids_and_scores())
       if len(search_results) > k:
           search_results.sort(key=lambda x: x[1], reverse=True)
           search_results = search_results[:k]
       ids, scores = zip(*search_results)
       return fps_search.FPSSearchResult(ids, scores)

   def threshold_tanimoto_search_fp(self, fp, threshold=0.7):
       search_results = []
       for arena in self.arenas:
           arena_results = arena.threshold_tanimoto_search_fp(fp, threshold)
           search_results.extend(arena_results.get_ids_and_scores())
       ids, scores = zip(*search_results)
       return fps_search.FPSSearchResult(ids, scores)


With that in place I can do

import itertools
import random

# 32 245 089 records
fps = chemfp.open("tree.fps")
evens = chemfp.load_fingerprints(itertools.islice(fps, 0, None, 2), 
fps.metadata)
print "Loaded evens"
fps = chemfp.open("tree.fps")
odds = chemfp.load_fingerprints(itertools.islice(fps, 1, None, 2), fps.metadata)
print "Loaded odds"

arena = MultiArena([evens, odds])

id, fp = random.choice(arena)
hits = arena.knearest_tanimoto_search_fp(fp, k=100, threshold=0.0)

print len(hits), "hits"


------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to