If I'm understanding the below correctly, it seems that the file format supports finding an arbitrary compressed buffer without decompressing anything else. Correct?
-John /// ---------------------------------------------------------------------- /// A Buffer represents a single contiguous memory segment struct Buffer { /// The relative offset into the shared memory page where the bytes for this /// buffer starts offset: long; /// The absolute length (in bytes) of the memory buffer. The memory is found /// from offset (inclusive) to offset + length (non-inclusive). When building /// messages using the encapsulated IPC message, padding bytes may be written /// after a buffer, but such padding bytes do not need to be accounted for in /// the size here. length: long; } On Thu, Sep 22, 2022 at 9:10 AM John Muehlhausen <j...@jgm.org> wrote: > Regarding tab=feather.read_table(fname, memory_map=True) > > Uncompressed: low-cost setup and len(tab), data is read when sections of > the map are "paged in" by the OS > Compressed (desired): > * low-cost setup > * read the length of the "table" without decompressing anything .... ( > len(tab) ) > * low-cost slicing > * data is decompressed only on dereference, such that there is a "batch > column cache" corresponding to the lower level page cache concept; batch > columns are only decompressed when they are accessed. Repeated access does > not re-decompress (subject to LRU column cache capacity). > > The above design would allow drop-in replacement of compressed files for > uncompressed files for workloads that depend on the characteristics of > uncompressed mmap, it seems to me. The only difference would be two levels > of "paging in" (page cache -> a new column cache on the heap) instead of > just the OS page cache level. > > Is there anything about the file format that would make this impossible? > *For example, do we know where each compressed buffer starts without > needing to decompress buffers around it?* > > Thanks, > John > > On Wed, Sep 21, 2022 at 11:53 PM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > >> Hi, >> >> AFAIK compressed IPC arrow files do not support random access (like >> uncompressed counterparts) - you need to decompress the whole batch (or at >> least the columns you need). A "RecordBatch" is the compression unit of >> the >> file. Think of it like a parquet file whose every row group has a single >> data page. >> >> With that said, the file contains the number of rows that each record >> batch >> has, and thus you can compute which record batch you can start from. This >> is available in the RecordBatch message "length" [1]. Unfortunately you do >> need to read the record batch message to know it, and these are located in >> the middle of the file: >> >> [header][record_batch1_message][record_batch1_data][record_batch2_message][record_batch2_data]...[footer]. >> The positions of the messages are declared in the file's footer's >> "record_batches". >> >> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L87 >> >> Best, >> Jorge >> >> >> On Thu, Sep 22, 2022 at 3:01 AM John Muehlhausen <j...@jgm.org> wrote: >> >> > Why aren't all the compressed batches the chunk size I specified in >> > write_feather (7000000)? How can I know which batch my slice resides >> in if >> > this is not a constant? Using pyarrow 9.0.0 >> > >> > This file contains 1.5 billion rows. I need a way to know where to look >> > for, say, [780567127,922022522) >> > >> > 0.7492516040802002 done 0 len 7000000 >> > 1.7520167827606201 done 1 len 7000000 >> > 3.302407741546631 done 2 len 4995912 >> > 5.16457986831665 done 3 len 7000000 >> > 6.0424370765686035 done 4 len 4706276 >> > 7.58642315864563 done 5 len 7000000 >> > 7.719322681427002 done 6 len 289636 >> > 8.705692291259766 done 7 len 5698775 >> > >> > On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen <j...@jgm.org> wrote: >> > >> > > The following seems like good news... like I should be able to >> decompress >> > > just one column of a RecordBatch in the middle of a compressed >> feather v2 >> > > file. Is there a Python API for this kind of access? C++? >> > > >> > > /// Provided for forward compatibility in case we need to support >> > different >> > > /// strategies for compressing the IPC message body (like whole-body >> > > /// compression rather than buffer-level) in the future >> > > enum BodyCompressionMethod:byte { >> > > /// Each constituent buffer is first compressed with the indicated >> > > /// compressor, and then written with the uncompressed length in the >> > > first 8 >> > > /// bytes as a 64-bit little-endian signed integer followed by the >> > > compressed >> > > /// buffer bytes (and then padding as required by the protocol). The >> > > /// uncompressed length may be set to -1 to indicate that the data >> that >> > > /// follows is not compressed, which can be useful for cases where >> > > /// compression does not yield appreciable savings. >> > > BUFFER >> > > } >> > > >> > > On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen <j...@jgm.org> wrote: >> > > >> > >> ``Internal structure supports random access and slicing from the >> middle. >> > >> This also means that you can read a large file chunk by chunk without >> > >> having to pull the whole thing into memory.'' >> > >> https://ursalabs.org/blog/2020-feather-v2/ >> > >> >> > >> For a compressed v2 file, can I decompress just one column of a >> batch in >> > >> the middle, or is the entire batch with all of its columns compressed >> > as a >> > >> unit? >> > >> >> > >> Unfortunately reader.get_batch(i) seems like it is doing a lot of >> work. >> > >> Like maybe decompressing all the columns? >> > >> >> > >> Thanks, >> > >> John >> > >> >> > > >> > >> >