Hi Jochen, On Mar 3, 2012, at 2:59 AM, Jochen Schreiber wrote: > My problem now is that babel only one file expected and i have the complete > pubchem compound sdf files with over 2000 sdf files. > > If i concat all sdf files to one file it has over 187 GB and the index will > be over 2GB.
You might be interested in chemfp ( http://code.google.com/p/chem-fingerprints/ ). I've been doing some work recently to see if it will handle all of PubChem. It doesn't. The problem is that it also has a 2GB limit. On the mailing list I posted one workaround. I've put a copy at the end of this email. With it I'm able to do k-nearest Tanimoto searches with subsecond performance. chemfp uses a number of optimizations that aren't in the OpenBabel search code. Andrew da...@dalkescientific.com import chemfp from chemfp import fps_search class MultiArena(object): def __init__(self, arenas): self.arenas = arenas self._size = sum(map(len, arenas)) def __len__(self): return self._size def __iter__(self): for arena in self.arenas: for x in arena: yield x def __getitem__(self, i): assert i >= 0, i for arena in self.arenas: if i >= len(arena): i -= len(arena) else: return arena[i] raise IndexError(i) def count_tanimoto_hits_fp(self, fp, threshold=0.7): return sum(arena.count_tanimoto_hits_fp(fp, threshold) for arena in self.arenas) def knearest_tanimoto_search_fp(self, fp, k=3, threshold=0.7): # The only way to merge the k-nearest values is to extract # everything and turn them into FPSSearchResults search_results = [] for arena in self.arenas: arena_results = arena.knearest_tanimoto_search_fp(fp, k, threshold) search_results.extend(arena_results.get_ids_and_scores()) if len(search_results) > k: search_results.sort(key=lambda x: x[1], reverse=True) search_results = search_results[:k] ids, scores = zip(*search_results) return fps_search.FPSSearchResult(ids, scores) def threshold_tanimoto_search_fp(self, fp, threshold=0.7): search_results = [] for arena in self.arenas: arena_results = arena.threshold_tanimoto_search_fp(fp, threshold) search_results.extend(arena_results.get_ids_and_scores()) ids, scores = zip(*search_results) return fps_search.FPSSearchResult(ids, scores) With that in place I can do import itertools import random # 32 245 089 records fps = chemfp.open("tree.fps") evens = chemfp.load_fingerprints(itertools.islice(fps, 0, None, 2), fps.metadata) print "Loaded evens" fps = chemfp.open("tree.fps") odds = chemfp.load_fingerprints(itertools.islice(fps, 1, None, 2), fps.metadata) print "Loaded odds" arena = MultiArena([evens, odds]) id, fp = random.choice(arena) hits = arena.knearest_tanimoto_search_fp(fp, k=100, threshold=0.0) print len(hits), "hits" ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss