Alrighty, I looked at a bunch of code and then I decided to play around with it to see if I could figure it out. I have some code that should be easy to copy-paste and play around with at [1]. The short story is I learned a lot (meaning I internalized it) but none of it is anything you didn't necessarily figure out yourself.
IMO, there is a hurdle at [2] where it should be possible to use `seek()` on the `CompressedInputStream` because the `BufferReader` is seekable; but, `CompressedInputStream` is constructed such that it's input_stream is not seekable. I wrote what I think the __init__ method should do for CompressedInputStream in the gist (specifically the patch.py file [3]). But, that being said, slicing the buffer itself didn't work (I got the same errors), but by changing the stream positions and getting different error messages, I have decided that the zstd codec DOES need the whole stream to be decoded before you can seek around and extract data. The meta sink approach works because each time a `compressed_sink` (and the `sink` it writes to) is closed, it "completes" the encoding, which can then be appended to another completed segment/window. All that just to basically say that closing of the buffer isn't a problem (and IMO I expect a context manager to close its object when it "goes away"), it's the choice of (or implementation of) compression algorithm. [1]: https://gist.github.com/drin/886ece8a634bf58cba2d329152b2ebd5 [2]: https://github.com/apache/arrow/blob/main/python/pyarrow/io.pxi#L1799-L1805 [3]: https://gist.github.com/drin/886ece8a634bf58cba2d329152b2ebd5#file-patch-py # ------------------------------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Friday, October 11th, 2024 at 13:32, Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: > Hi Robert, > > I hit the same problem recently but there’s a Python-only workaround you can > use. > > https://github.com/apache/arrow-experiments/pull/35/files#r1797397257 > > — > Felipe > > On Fri, 11 Oct 2024 at 05:13 Antoine Pitrou <anto...@python.org> wrote: > > > > > Hi Robert, > > > > On Thu, 10 Oct 2024 08:33:28 -0700 > > Robert McLeod <robbmcl...@gmail.com> wrote: > > > > > > I think my trouble is coming from the fact that a `CompressedOutputStream` > > > closes the `BufferedOutputStream` when it exits its context manager. I > > > don't know if that is required or if it is simply an oversight because no > > > one else has tried to fetch individual objects from a compressed stream > > > before. > > > > It's not really required. We could change this if not closing the > > underlying stream is more convenient. > > > > Could you perhaps open a feature request at > > https://github.com/apache/arrow/issues ? > > > > (bonus points if you also submit a PR, but that's not mandatory :-)) > > > > Regards > > > > Antoine. > >
publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature