Might be better off using a web service like ChemSpider to do the matching for you <http://www.chemspider.com/AboutServices.aspx?>. The idea that you can identify the synonyms by name is probably optimistic unless they are exact matches.
Here's some python code that seems to make it pretty easy: https://github.com/mcs07/ChemSpiPy. Search the names, extract the InChI for the best match and then you can match them in R via the InChI. Might require some fixing by hand afterwards. HTH, Jason Law -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Zsurzsa Laszlo Sent: Wednesday, July 03, 2013 7:28 AM To: r-help@r-project.org Subject: [R] String based chemical name identification The problem is the following: I have two big databases one look like this: 2-Methyl-4-trimethylsilyloxyoct-5-yne Benzoic acid, methyl ester Benzoic acid, 2-methyl-, methyl ester Acetic acid, phenylmethyl ester 2,7-Dimethyl-4-trimethylsilyloxyoct-7-en-5-yne etc. The second one looks like this: Name: D-Tagatose 1,6-bisphosphate Name: 1-Phosphatidyl-D-myo-inositol;: 1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl-myo-inositol;: Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;: 1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol Name: Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione Name: Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine Name: H+;: Hydron Name: 3-Iodo-L-tyrosine etc. Both of them have more then 3000 lines. Matching their name by hand is not an option because I don't know chemistry. *Possible solution I came up with*: Go through all the names of the first database and then try to match with the other one. I'm using *regexec *and *strsplit *functions for the matching. Basically I split the name into small chunks and try to get some hit in the other database. I can supply code If needed but I did not want to spam in the first mail. Any solution is welcome! It can be in pseudo-cod also or in any type of logical arguing. It does not matter. Laszlo-Andras Zsurzsa Msc. Informatics, Technical University Munchen [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.