This is related to my last post (see: http://groups.google.com/group/comp.lang.python/browse_thread/thread/c333cbbb5d496584/998af2bb2ca10e88#998af2bb2ca10e88)
I have a text file with an EINECS number, a CAS number, a Chemical Name, and a Chemical Formula, always in this order. However, I realized as I ran my script that I had entries like 274-989-4 70892-58-9 diazotovaná kyselina 4- aminobenzénsulfónová, kopulovaná s farbiarskym moruovým (Chlorophora tinctoria) extraktom, komplexy so elezom komplexy eleza s produktami kopulácie diazotovanej kyseliny 4- aminobenzénsulfónovej s látkou registrovanou v Indexe farieb pod identifika ným íslom Indexu farieb, C.I. 75240. which become 274-989-4|70892-58-9|diazotovaná kyselina 4- aminobenzénsulfónová, kopulovaná s farbiarskym moruovým (Chlorophora tinctoria) extraktom, komplexy so elezom komplexy eleza s produktami kopulácie diazotovanej kyseliny 4- aminobenzénsulfónovej s látkou registrovanou v Indexe farieb pod identifika ným íslom Indexu farieb, C.I.|75240. The C.I 75240 is not a chemical formula and there isn't one. So I want to add a regular expression for the chemical name for an if statement that stipulates if there is not chemical formula to move on. However, I must be getting confused from the regular expression tutorials I've been reading. Any ideas? Original Code: #For text files in a directory... #Analyzes a randomly organized UTF8 document with EINECS, CAS, Chemical, and Chemical Formula #into a document structured as EINECS|CAS|Chemical|Chemical Formula. import os import codecs import re path = "C:\\text_samples\\text" #folder with all text files path2 = "C:\\text_samples\\text\\output" #output of all text files NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS number def iter_elements(tokens): product = [] for tok in tokens: if NR_RE.match(tok) and len(product) >= 4: product[2:-1] = [' '.join(product[2:-1])] yield product product = [] product.append(tok) yield product for text in os.listdir(path): input_text = os.path.join(path,text) output_text = os.path.join(path2,text) input = codecs.open(input_text, 'r','utf8') output = codecs.open(output_text, 'w', 'utf8') tokens = input.read().split() for element in iter_elements(tokens): #print '|'.join(element) output.write('|'.join(element)) output.write("\r\n") input.close() output.close() On Oct 23, 5:03 pm, Paul McGuire <[EMAIL PROTECTED]> wrote: > On Oct 22, 5:29 pm, [EMAIL PROTECTED] wrote: > > > > > Hi, > > > I'm trying to learn regular expressions, but I am having trouble with > > this. I want to search a document that has mixed data; however, the > > last line of every entry has something like C5H4N4O3 or CH5N3.ClH. > > All of the letters are upper case and there will always be numbers and > > possibly one . > > > However below only gave me none. > > > import os, codecs, re > > > text = 'C:\\text_samples\\sample.txt' > > text = codecs.open(text,'r','utf-8') > > > test = re.compile('\u+\d+\.') > > > for line in text: > > print test.search(line) > > If those are chemical symbols, then I guarantee that there will be > lower case letters in the expression (like the "l" in "ClH"). > > -- Paul
-- http://mail.python.org/mailman/listinfo/python-list