On Thu, Jan 21 2021 at 08:22:08 AM, Frank Millman <fr...@chagford.com> wrote: > Hi all > > This question is mostly to satisfy my curiosity. > > In my app I use xml to represent certain objects, such as form > definitions and process definitions. > > They are stored in a database. I use etree.tostring() when storing > them and etree.fromstring() when reading them back. They can be quite > large, so I use gzip to compress them before storing them as a blob. > > The sequence of events when reading them back is - > - select gzip'd data from database > - run gzip.decompress() to convert to a string > - run etree.fromstring() to convert to an etree object > > I was wondering if I could avoid having the unzipped string in memory, > and create the etree object directly from the gzip'd data. I came up > with this - > > - select gzip'd data from database > - create a BytesIO object - fd = io.BytesIO(data) > - use gzip to open the object - gf = gzip.open(fd) > - run etree.parse(gf) to convert to an etree object > > It works. > > But I don't know what goes on under the hood, so I don't know if this > achieves anything. If any of the steps involves decompressing the data > and storing the entire string in memory, I may as well stick to my > present approach. > > Any thoughts? >
etree.parse will hold the entire uncompressed content in memory regardless of how you supply it input. If your question is whether you can avoid holding an extra copy in memory, you can take a look at the ElementTree code in https://github.com/python/cpython/blob/3.9/Lib/xml/etree/ElementTree.py (linked from the documentation of the library module). The parse method appears to read 64k at a time from the underlying stream, so using the gzip.open stream instead of gzip.decompress should limit the duplicated data being held in memory. It is possible to use the XMLPullParser or iterparse etree features to incrementally parse XML without ever holding the entire content in memory. But that will not give you an ElementTree object, and might not be feasible without an entire rewrite of the rest of the code. -- regards, kushal -- https://mail.python.org/mailman/listinfo/python-list