I just ran into a surprising (to me) issue with list() on an iterable object.
My object represents an MDAT box in an MP4 file: it is the ludicrously large data box containing the raw audiovideo data; for a TV episode it is often about 2GB and a movie is often 4GB to 6GB. For obvious reasons, I do not always want to load that into memory, or even read the data part at all when scanning an MP4 file, for example to recite its metadata. So my parser has a "skip" mode where it seeks straight past the data, but makes a note of its length in bytes. All good. That length is presented via the object's __len__ method, because I want to know that length later and this is a subclass of a suite of things which return their length in bytes this way. So, to my problem: I've got a walk method which traverses the hierarchy of boxes in the MP4 file. Until some minutes ago, it looked like this: def walk(self): subboxes = list(self) yield self, subboxes for subbox in subboxes: if isinstance(subbox, Box): yield from subbox.walk() somewhat like os.walk does for a file tree. I noticed that it was stalling, and investigation revealed it was stalling at this line: subboxes = list(self) when doing the MDAT box. That box (a) has no subboxes at all and (b) has a very large __len__ value. BUT... It also has a __iter__ value, which like any Box iterates over the subboxes. For MDAT that is implemented like this: def __iter__(self): yield from () What I was expecting was pretty much instant construction of an empty list. What I was getting was a very time consuming (10 seconds or more) construction of an empty list. I believe that this is because list() tries to preallocate storage. I _infer_ from the docs that this is done maybe using operator.length_hint, which in turn consults "the actual length of the object" (meaning __len__ for me?), then __length_hint__, then defaults to 0. I've changed my walk function like so: def walk(self): subboxes = [] for subbox in self: subboxes.append(subbox) ##subboxes = list(self) and commented out the former list(self) incantation. This is very fast, because it makes an empty list and then appends nothing to it. And for your typical movie file this is fine, because there are never _very_ many immediate subboxes anyway. But is there a cleaner way to do this? I'd like to go back to my former list(self) incantation, and modify the MDAT box class to arrange something efficient. Setting __length_hint__ didn't help: returning NotImplemeneted or 0 had no effect, because presumably __len__ was consulted first. Any suggestions? My current approach feels rather hacky. I'm already leaning towards making __len__ return the number of subboxes to match the iterator, especially as on reflection not all my subclasses are consistent about __len__ meaning the length of their binary form; I'm probably going to have to fix that - some subclasses are actually namedtuples where __len__ would be the field count. Ugh. Still, thoughts? I'm interested in any approaches that would have let me make list() fast while keeping __len__==binary_length. I'm accepting that __len__ != len(__iter__) is a bad idea now, though. Cheers, Cameron Simpson <c...@cskk.id.au> -- https://mail.python.org/mailman/listinfo/python-list