On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com> wrote: > - As you have seen, the line separator is not '\n' but its '|\n'. > Sometimes the data itself has '\n' characters in the middle of the line > and only way to find true end of the line is that previous character > should be a bar '|'. I was not able specify end of line using > readlines() function, but I could do it using split() function. > (One hack would be to readlines and combine them until I find '|\n'. is > there a cleaner way to do this?)
You can use a generator to take care of your readlines requirements: def readlines(f): lines = [] while "f is not empty": line = f.readline() if not line: break if len(line) > 2 and line[-2:] == '|\n': lines.append(line) yield ''.join(lines) lines = [] else: lines.append(line) > - Reading whole file at once and processing line by line was must > faster. Though speed is not of very important issue here but I think the > tie it took to parse complete file was reduced to one third of original > time. With the readlines generator above, it'll read lines from the file until it has a complete "line" by your requirement, at which point it'll yield it. If you don't need the entire file in memory for the end result, you'll be able to process each "line" one at a time and perform whatever you need against it before asking for the next. with open(u'infile.txt','r') as infile: for line in readlines(infile): ... Generators are a very efficient way of processing large amounts of data. You can chain them together very easily: real_lines = readlines(infile) marker_lines = (l for l in real_lines if l.startswith('#')) every_second_marker = (l for i,l in enumerate(marker_lines) if (i +1) % 2 == 0) map(some_function, every_second_marker) The real_lines generator returns your definition of a line. The marker_lines generator filters out everything that doesn't start with #, while every_second_marker returns only half of those. (Yes, these could all be written as a single generator, but this is very useful for more complex pipelines). The big advantage of this approach is that nothing is read from the file into memory until map is called, and given the way they're chained together, only one of your lines should be in memory at any given time. -- http://mail.python.org/mailman/listinfo/python-list