Thanks for your valuable inputs. This is very helpful.
-----Original Message----- From: Python-list [mailto:python-list-bounces+alok.jadhav=credit-suisse....@python.org] On Behalf Of Dave Angel Sent: Monday, September 17, 2012 6:47 PM To: alex23 Cc: python-list@python.org Subject: Re: Python garbage collector/memory manager behaving strangely On 09/16/2012 11:25 PM, alex23 wrote: > On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com> > wrote: >> - As you have seen, the line separator is not '\n' but its '|\n'. >> Sometimes the data itself has '\n' characters in the middle of the line >> and only way to find true end of the line is that previous character >> should be a bar '|'. I was not able specify end of line using >> readlines() function, but I could do it using split() function. >> (One hack would be to readlines and combine them until I find '|\n'. is >> there a cleaner way to do this?) > You can use a generator to take care of your readlines requirements: > > def readlines(f): > lines = [] > while "f is not empty": > line = f.readline() > if not line: break > if len(line) > 2 and line[-2:] == '|\n': > lines.append(line) > yield ''.join(lines) > lines = [] > else: > lines.append(line) There's a few changes I'd make: I'd change the name to something else, so as not to shadow the built-in, and to make it clear in caller's code that it's not the built-in one. I'd replace that compound if statement with if line.endswith("|\n": I'd add a comment saying that partial lines at the end of file are ignored. >> - Reading whole file at once and processing line by line was must >> faster. Though speed is not of very important issue here but I think the >> tie it took to parse complete file was reduced to one third of original >> time. You don't say what it was faster than. Chances are you went to the other extreme, of doing a read() of 1 byte at a time. Using Alex's approach of a generator which in turn uses the readline() generator. > With the readlines generator above, it'll read lines from the file > until it has a complete "line" by your requirement, at which point > it'll yield it. If you don't need the entire file in memory for the > end result, you'll be able to process each "line" one at a time and > perform whatever you need against it before asking for the next. > > with open(u'infile.txt','r') as infile: > for line in readlines(infile): > ... > > Generators are a very efficient way of processing large amounts of > data. You can chain them together very easily: > > real_lines = readlines(infile) > marker_lines = (l for l in real_lines if l.startswith('#')) > every_second_marker = (l for i,l in enumerate(marker_lines) if (i > +1) % 2 == 0) > map(some_function, every_second_marker) > > The real_lines generator returns your definition of a line. The > marker_lines generator filters out everything that doesn't start with > #, while every_second_marker returns only half of those. (Yes, these > could all be written as a single generator, but this is very useful > for more complex pipelines). > > The big advantage of this approach is that nothing is read from the > file into memory until map is called, and given the way they're > chained together, only one of your lines should be in memory at any > given time. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list =============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html =============================================================================== -- http://mail.python.org/mailman/listinfo/python-list