Somewhat, though maybe not as bad. The arrow format only lists the schema once and the per-batch data is just lengths. For disk size I ran an experiment on 100k rows x 10k columns of float64 and got:
-rw-rw-r-- 1 pace pace 8005840602 Nov 22 09:35 10_batches.arrow -rw-rw-r-- 1 pace pace 8049051402 Nov 22 09:40 100_batches.arrow -rw-rw-r-- 1 pace pace 8481159402 Nov 22 09:51 1000_batches.arrow -rw-rw-r-- 1 pace pace 9786751894 Nov 22 09:52 10_batches.parquet -rw-rw-r-- 1 pace pace 9577577138 Nov 22 09:53 100_batches.parquet -rw-rw-r-- 1 pace pace 12107949064 Nov 22 10:12 1000_batches.parquet Note, this is something of a pathological case for parquet as random float data is incompressible so don't focus too much on arrow vs parquet. I'm trying to illustrate the effect of having more record batches. I'm a little surprised by the difference between 10 and 100 batches in parquet but maybe someone else can guess. Also, note that the row groups don't have to be super large. Between 10 & 100 the effect is pretty small compared to the size of the data. It's not really noticeable until you hit 1000 batches (which means 100 rows per batch). Size on disk is only half the problem. The other question is processing time. That's a little harder to measure, and it's something that is always improving or has the potential to improve. For example, I timed reading a single column from those files into disk with the datasets API. For each read I recorded the time of the second read so this is cached-in-memory reads with little to no I/O time. IPC: 10 batches: 2.2 seconds 100 batches: 2.6 seconds 1000 batches: 4.9 seconds Parquet: 10 batches: 0.25 seconds 100 batches: 1.75 seconds 1000 batches: 16.87 seconds Again, don't worry too much about IPC vs. Parquet. These results are from 6.0.1 and I'm not using memory mapped files. The performance in 7.0.0 will probably be better for this test because of better IPC support for column projection pushdown (thanks Yue Ni!) On Mon, Nov 22, 2021 at 8:17 AM Aldrin <akmon...@ucsc.edu.invalid> wrote: > > Hi Weston, > > This is slightly off-topic, but I'm curious if what you mentioned about the > large metadata blocks (inlined below) also applies to IPC format? > > I am working with matrices and representing them as tables that can have > hundreds of thousands of columns, but I'm splitting them into row groups to > apply push down predicates. > > Finally, one other issue that comes into play, is the width of your > > data. Really wide datasets (e.g. tens of thousands of columns) suffer > > from having rather large metadata blocks. If your row groups start to > > get small then you end up spending a lot of time parsing metadata and > > much less time actually reading data. > > > > Thanks! > > > -- > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz