On Oct 15, 12:20 pm, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote: > > my sample input file looks like this( not organized,as you see it): > > 200-720-7 69-93-2 > > kyselina mocová C5H4N4O3 > > > 200-001-8 50-00-0 > > formaldehyd CH2O > > > 200-002-3 > > 50-01-1 > > guanidĂnium-chlorid CH5N3.ClH > > > etc... > > That's quite irregular so it is not that straightforward. One way is to > split everything into words, start a record by taking the first two > elements and then look for the start of the next record that looks like > three numbers concatenated by '-' characters. Quick and dirty hack: > > import codecs > import re > > NR_RE = re.compile(r'^\d+-\d+-\d+$') > > def iter_elements(tokens): > tokens = iter(tokens) > try: > nr_a = tokens.next() > while True: > nr_b = tokens.next() > items = list() > for item in tokens: > if NR_RE.match(item): > yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) > nr_a = item > break > else: > items.append(item) > except StopIteration: > yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
Maybe this is a bit more readable? def iter_elements(tokens): chem = [] for tok in tokens: if NR_RE.match(tok) and len(chem) >= 4: chem[2:-1] = [' '.join(chem[2:-1])] yield chem chem = [] chem.append(tok) yield chem -- Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list