On 19 September 2013 08:23, Peter Otten <__pete...@web.de> wrote: > Roy Smith wrote: >> >> I believe by "Peter's version", you're talking about: >> >>> from itertools import islice, tee >>> >>> with open("tmp.txt") as f: >>> while True: >>> for outer in f: >>> print outer, >>> if "*" in outer: >>> f, g = tee(f) >>> for inner in islice(g, 3): >>> print " ", inner, > del g # a good idea in the general case >>> break >>> else: >>> break >> >> There's this note from >> http://docs.python.org/2.7/library/itertools.html#itertools.tee: >> >>> This itertool may require significant auxiliary storage (depending on how >>> much temporary data needs to be stored). In general, if one iterator uses >>> most or all of the data before another iterator starts, it is faster to >>> use list() instead of tee().
This is referring to the case where your two iterators get out of sync by a long way. If you only consume 3 extra items it will just store those 3 items in a list. >> I have no idea how that interacts with the pattern above where you call >> tee() serially. Fair point. > > As I understand it the above says that > > items = infinite() > a, b = tee(items) > for item in islice(a, 1000): > pass > for pair in izip(a, b): > pass > > stores 1000 items and can go on forever, but > > items = infinite() > a, b = tee(items) > for item in a: > pass > > will consume unbounded memory and that if items is finite using a list > instead of tee is more efficient. The documentation says nothing about > > items = infinite() > a, b = tee(items) > del a > for item in b: > pass > > so you have to trust Mr Hettinger or come up with a test case... > >> You're basically doing >> >> with open("my_file") as f: >> while True: >> f, g = tee(f) >> >> Are all of those g's just hanging around, eating up memory, while waiting >> to be garbage collected? I have no idea. > > I'd say you've just devised a nice test to find out ;) $ cat tee.py #!/usr/bin/env python import sys from itertools import tee items = iter(range(int(sys.argv[1]))) while True: for x in items: items, discard = tee(items) break else: break print(x) $ time py -3.3 ./tee.py 100000000 99999999 real 1m47.711s user 0m0.015s sys 0m0.000s While running the above python.exe was using 6MB of memory (according to Task Manager). I believe this is because tee() works as follows (which I made up but it's how I imagine it). When you call tee(iterator) it creates two _tee objects and one _teelist object. The _teelist object stores all of the items that have been seen by only one of _tee1 and _tee2, a reference to iterator and a flag indicating which _tee object has seen more items. When say _tee2 is deallocated the _teelist becomes singly owned and no longer needs to ever accumulate items (so it doesn't). So the dereferenced discard will not cause an arbitrary growth in memory usage. There is a separate problem which is that if you call tee() multiple times then you end up with a chain of tees and each next call would go through each one of them. This would cause a linear growth in the time taken to call next() leading to quadratic time performance overall. However, this does not occur with the script I showed above. In principle it's possible for a _tee object to realise that there is a chain of singly owned _tee and _teelist objects and bypass them calling next() on the original iterator but I don't know if this is what happens. However, when I ran the above script on Python 2.7 it did consume massive amounts of memory (1.6GB) and ran slower so maybe this depends on optimisations that were introduced in 3.x. Here's an alternate iterator recipe that doesn't depend on these optimisations: from itertools import islice from collections import deque class Peekable(object): def __init__(self, iterable): self.iterator = iter(iterable) self.peeked = deque() def __iter__(self): while True: while self.peeked: yield self.peeked.popleft() yield next(self.iterator) def peek(self): for p in self.peeked: yield p for val in self.iterator: self.peeked.append(val) yield val with open("tmp.txt") as f: f = Peekable(f) for outer in f: print outer, if "*" in outer: for inner in islice(f.peek(), 3): print " ", inner, Oscar -- https://mail.python.org/mailman/listinfo/python-list