Re: [TOL] Changing Manifest Layer to Arrow File Format

Micah Kornfield Sat, 25 Jun 2022 10:32:53 -0700

Hi Taher,
First, Manifest lists and manifests are Avro files, not JSON

I think we should separate out two concerns:
1.  Encoding/decoding  efficiency
2.  Parallelization

For 1 Arrow will be faster then Avro for reading.  Parquet is also another
option.  For manifest lists it's not clear that a columnar format is better
here since in most cases we will want to read most of the fields so
decoding cost vs transposition cost would need to be measured.  For
manifests I think a columnar format could offer some clear benefits for
working with statistics but would require a deeper change to the manifest
schema to get the most efficiency.

2.  Avro supports parallel scans by file splitting.  Which might be
slightly less efficient then  parquet/arrow which can determine the exact
number of batches via metadata but the exact benefit would have to be
measured.

All of these would be substantial changes so I think there should probably
do more concrete benchmarking before seriously considering the change.

Cheers,
Micah

On Friday, June 24, 2022, Taher Koitawala <taher...@gmail.com> wrote:

> Hi All,
>          I was looking at the Iceberg metadata layer of Manifest Lists and
> Manifest Files and they are all JSON formats. Thinking out loud, what if
> that layer was changed to Arrow file format?
>
> Since Arrow's in-memory representation is the same as the on-disk
> representation. A lot of overhead of serde, reading and writing could be
> saved. Also because Arrow has RecordBatches we could parallelize the read
> of that to multiple threads. This is just thinking out loud. Please let me
> know what your thoughts are.
>
> Regards,
> Taher Koitawala
>

Re: [TOL] Changing Manifest Layer to Arrow File Format

Reply via email to