Roy Smith wrote: >> Dave Angel <da...@davea.name> wrote (and I agreed with): >>> I'd suggest you open the file twice, and get two file objects. Then you >>> can iterate over them independently. > > > On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote: >> There's no need to use OS resources by opening the file twice or to >> screw up the IO caching with seek(). > > There's no reason NOT to use OS resources. That's what the OS is there > for; to make life easier on application programmers. Opening a file twice > costs almost nothing. File descriptors are almost as cheap as whitespace. > >> Peter's version holds just as many lines as is necessary in an >> internal Python buffer and performs the minimum possible >> amount of IO. > > I believe by "Peter's version", you're talking about: > >> from itertools import islice, tee >> >> with open("tmp.txt") as f: >> while True: >> for outer in f: >> print outer, >> if "*" in outer: >> f, g = tee(f) >> for inner in islice(g, 3): >> print " ", inner, del g # a good idea in the general case >> break >> else: >> break > > > There's this note from > http://docs.python.org/2.7/library/itertools.html#itertools.tee: > >> This itertool may require significant auxiliary storage (depending on how >> much temporary data needs to be stored). In general, if one iterator uses >> most or all of the data before another iterator starts, it is faster to >> use list() instead of tee(). > > > I have no idea how that interacts with the pattern above where you call > tee() serially.
As I understand it the above says that items = infinite() a, b = tee(items) for item in islice(a, 1000): pass for pair in izip(a, b): pass stores 1000 items and can go on forever, but items = infinite() a, b = tee(items) for item in a: pass will consume unbounded memory and that if items is finite using a list instead of tee is more efficient. The documentation says nothing about items = infinite() a, b = tee(items) del a for item in b: pass so you have to trust Mr Hettinger or come up with a test case... > You're basically doing > > with open("my_file") as f: > while True: > f, g = tee(f) > > Are all of those g's just hanging around, eating up memory, while waiting > to be garbage collected? I have no idea. I'd say you've just devised a nice test to find out ;) > But I do know that no such > problems exist with the two file descriptor versions. The trade-offs are different. My version works with arbitrary iterators (think stdin), but will consume unbounded amounts of memory when the inner loop doesn't stop. -- https://mail.python.org/mailman/listinfo/python-list