Alrighty, I looked at a bunch of code and then I decided to play around with it 
to see if I could figure it out. I have some code that should be easy to 
copy-paste and play around with at [1]. The short story is I learned a lot 
(meaning I internalized it) but none of it is anything you didn't necessarily 
figure out yourself.

IMO, there is a hurdle at [2] where it should be possible to use `seek()` on 
the `CompressedInputStream` because the `BufferReader` is seekable; but, 
`CompressedInputStream` is constructed such that it's input_stream is not 
seekable. I wrote what I think the __init__ method should do for 
CompressedInputStream in the gist (specifically the patch.py file [3]).

But, that being said, slicing the buffer itself didn't work (I got the same 
errors), but by changing the stream positions and getting different error 
messages, I have decided that the zstd codec DOES  need the whole stream to be 
decoded before you can seek around and extract data. The meta sink approach 
works because each time a `compressed_sink` (and the `sink` it writes to) is 
closed, it "completes" the encoding, which can then be appended to another 
completed segment/window.

All that just to basically say that closing of the buffer isn't a problem (and 
IMO I expect a context manager to close its object when it "goes away"), it's 
the choice of (or implementation of) compression algorithm.


[1]: https://gist.github.com/drin/886ece8a634bf58cba2d329152b2ebd5
[2]: https://github.com/apache/arrow/blob/main/python/pyarrow/io.pxi#L1799-L1805

[3]: https://gist.github.com/drin/886ece8a634bf58cba2d329152b2ebd5#file-patch-py




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Friday, October 11th, 2024 at 13:32, Felipe Oliveira Carvalho 
<felipe...@gmail.com> wrote:

> Hi Robert,
> 

> I hit the same problem recently but there’s a Python-only workaround you can 
> use.
> 

> https://github.com/apache/arrow-experiments/pull/35/files#r1797397257
> 

> —
> Felipe
> 

> On Fri, 11 Oct 2024 at 05:13 Antoine Pitrou <anto...@python.org> wrote:
> 

> > 

> > Hi Robert,
> > 

> > On Thu, 10 Oct 2024 08:33:28 -0700
> > Robert McLeod <robbmcl...@gmail.com> wrote:
> > >
> > > I think my trouble is coming from the fact that a `CompressedOutputStream`
> > > closes the `BufferedOutputStream` when it exits its context manager. I
> > > don't know if that is required or if it is simply an oversight because no
> > > one else has tried to fetch individual objects from a compressed stream
> > > before.
> > 

> > It's not really required. We could change this if not closing the
> > underlying stream is more convenient.
> > 

> > Could you perhaps open a feature request at
> > https://github.com/apache/arrow/issues ?
> > 

> > (bonus points if you also submit a PR, but that's not mandatory :-))
> > 

> > Regards
> > 

> > Antoine.
> > 

Attachment: publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to