Dear Greg, In your note below you talk about saving a molecule in a binary format. By this you mean a fingerprint? But in that case you wouldn't be able to perform SMARTS matches, right? Only at most approximate Tversky similarity calculations, only if your SMARTS is a valid SMILES.
Thanks, Gonzalo -----Original Message----- From: Greg Landrum [mailto:[email protected]] Sent: 24 July 2012 16:56 To: Gonzalo Colmenarejo-Sanchez Cc: [email protected] Subject: Re: [Rdkit-discuss] speed of SMARTS matches calculations On Tue, Jul 24, 2012 at 4:38 PM, Gonzalo Colmenarejo-Sanchez <[email protected]> wrote: > > Sorry I can't share the SMILES and SMARTS, they are proprietary. yeah, I kind of figured that would be the case. :-) > If you can send me your structures I can test them with my program. The scripts and data for the benchmarking are all in $RDBASE/Regress > I double loop in the building of molecules and queries; the actual code is > this: > > for (i = 0; i < numsmi; i++) > { > mol = SmilesToMol(smiles[i].smiles); > numsims = 0; > fprintf(fpout, "%s,", smiles[i].smiles); > fprintf(stdout, "%d\n", i); > for (j = 0; j < numsma; j++) > { > pattern = SmartsToMol(smarts[j].smarts); > matchesfound = SubstructMatch(*mol,*pattern,matches, false, > false); > if (matchesfound == true) > { > numsims = numsims + 1; > if (numsims == 1) fprintf(fpout, "%s\n", smarts[j].smarts); > else fprintf(fpout, "%s,%s\n", smiles[i].smiles, > smarts[j].smarts); > } > delete pattern; > } > if (numsims == 0) fprintf(fpout, "\n"); > delete mol; > } > > > The same double loop structure is used in the DL program. I could build the > molecules and queries at once as you suggest but I'm kind of testing my > typical situation that involves millions of molecules - not sure if that many > of molecules can be stored in memory. > The above is ok w.r.t. the molecules: each molecule is only constructed once.[1] Your SMARTS queries are, on the other hand, being constructed over and over again. You would probably see some speedup by building the query molecules outside the molecule loop and just using those inside the loop. -greg [1] Note: if you have a set of molecules you process over and over again, there are some time-saving tricks for working with them. One is to process them once and then save them in binary form, the other is to process them once, output the RDKit canonical SMILES, and then rebuild molecules from that using only partial sanitization. ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

