Hi Taher, First, Manifest lists and manifests are Avro files, not JSON I think we should separate out two concerns: 1. Encoding/decoding efficiency 2. Parallelization
For 1 Arrow will be faster then Avro for reading. Parquet is also another option. For manifest lists it's not clear that a columnar format is better here since in most cases we will want to read most of the fields so decoding cost vs transposition cost would need to be measured. For manifests I think a columnar format could offer some clear benefits for working with statistics but would require a deeper change to the manifest schema to get the most efficiency. 2. Avro supports parallel scans by file splitting. Which might be slightly less efficient then parquet/arrow which can determine the exact number of batches via metadata but the exact benefit would have to be measured. All of these would be substantial changes so I think there should probably do more concrete benchmarking before seriously considering the change. Cheers, Micah On Friday, June 24, 2022, Taher Koitawala <taher...@gmail.com> wrote: > Hi All, > I was looking at the Iceberg metadata layer of Manifest Lists and > Manifest Files and they are all JSON formats. Thinking out loud, what if > that layer was changed to Arrow file format? > > Since Arrow's in-memory representation is the same as the on-disk > representation. A lot of overhead of serde, reading and writing could be > saved. Also because Arrow has RecordBatches we could parallelize the read > of that to multiple threads. This is just thinking out loud. Please let me > know what your thoughts are. > > Regards, > Taher Koitawala >