On 28/02/2021 01:17, Cameron Simpson wrote:
I just ran into a surprising (to me) issue with list() on an iterable
object.

My object represents an MDAT box in an MP4 file: it is the ludicrously
large data box containing the raw audiovideo data; for a TV episode it
is often about 2GB and a movie is often 4GB to 6GB. For obvious reasons,
I do not always want to load that into memory, or even read the data
part at all when scanning an MP4 file, for example to recite its
metadata.

So my parser has a "skip" mode where it seeks straight past the data,
but makes a note of its length in bytes. All good.

That length is presented via the object's __len__ method, because I want
to know that length later and this is a subclass of a suite of things
which return their length in bytes this way.

So, to my problem:

I've got a walk method which traverses the hierarchy of boxes in the MP4
file. Until some minutes ago, it looked like this:

   def walk(self):
     subboxes = list(self)
     yield self, subboxes
     for subbox in subboxes:
       if isinstance(subbox, Box):
         yield from subbox.walk()

somewhat like os.walk does for a file tree.

I noticed that it was stalling, and investigation revealed it was
stalling at this line:

     subboxes = list(self)

when doing the MDAT box. That box (a) has no subboxes at all and (b) has
a very large __len__ value.

BUT... It also has a __iter__ value, which like any Box iterates over
the subboxes. For MDAT that is implemented like this:

     def __iter__(self):
         yield from ()

What I was expecting was pretty much instant construction of an empty
list. What I was getting was a very time consuming (10 seconds or more)
construction of an empty list.

I believe that this is because list() tries to preallocate storage. I
_infer_ from the docs that this is done maybe using
operator.length_hint, which in turn consults "the actual length of the
object" (meaning __len__ for me?), then __length_hint__, then defaults
to 0.

I've changed my walk function like so:

   def walk(self):
     subboxes = []
     for subbox in self:
       subboxes.append(subbox)
     ##subboxes = list(self)

list(iter(self))

should work, too. It may be faster than the explicit loop, but also
defeats the list allocation optimization.

and commented out the former list(self) incantation. This is very fast,
because it makes an empty list and then appends nothing to it. And for
your typical movie file this is fine, because there are never _very_
many immediate subboxes anyway.

But is there a cleaner way to do this?

I'd like to go back to my former list(self) incantation, and modify the
MDAT box class to arrange something efficient. Setting __length_hint__
didn't help: returning NotImplemeneted or 0 had no effect, because
presumably __len__ was consulted first.

Any suggestions? My current approach feels rather hacky.

I'm already leaning towards making __len__ return the number of subboxes
to match the iterator, especially as on reflection not all my subclasses
are consistent about __len__ meaning the length of their binary form;
I'm probably going to have to fix that - some subclasses are actually
namedtuples where __len__ would be the field count. Ugh.

Still, thoughts? I'm interested in any approaches that would have let me
make list() fast while keeping __len__==binary_length.

I'm accepting that __len__ != len(__iter__) is a bad idea now, though.

Indeed. I see how that train wreck happened -- but the weirdness is not
the list behavior.

Maybe you can capture the intended behavior of your class with two
classes, a MyIterable without length that can be converted into MyList
as needed.


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to