On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote: > my sample input file looks like this( not organized,as you see it): > 200-720-7 69-93-2 > kyselina mocová C5H4N4O3 > > 200-001-8 50-00-0 > formaldehyd CH2O > > 200-002-3 > 50-01-1 > guanidĂnium-chlorid CH5N3.ClH > > etc...
That's quite irregular so it is not that straightforward. One way is to split everything into words, start a record by taking the first two elements and then look for the start of the next record that looks like three numbers concatenated by '-' characters. Quick and dirty hack: import codecs import re NR_RE = re.compile(r'^\d+-\d+-\d+$') def iter_elements(tokens): tokens = iter(tokens) try: nr_a = tokens.next() while True: nr_b = tokens.next() items = list() for item in tokens: if NR_RE.match(item): yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) nr_a = item break else: items.append(item) except StopIteration: yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) def main(): in_file = codecs.open('test.txt', 'r', 'utf-8') tokens = in_file.read().split() in_file.close() for element in iter_elements(tokens): print '|'.join(element) Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list