Somewhat, though maybe not as bad.  The arrow format only lists the
schema once and the per-batch data is just lengths.  For disk size I
ran an experiment on 100k rows x 10k columns of float64 and got:

-rw-rw-r--  1 pace pace  8005840602 Nov 22 09:35 10_batches.arrow
-rw-rw-r--  1 pace pace  8049051402 Nov 22 09:40 100_batches.arrow
-rw-rw-r--  1 pace pace  8481159402 Nov 22 09:51 1000_batches.arrow
-rw-rw-r--  1 pace pace  9786751894 Nov 22 09:52 10_batches.parquet
-rw-rw-r--  1 pace pace  9577577138 Nov 22 09:53 100_batches.parquet
-rw-rw-r--  1 pace pace 12107949064 Nov 22 10:12 1000_batches.parquet

Note, this is something of a pathological case for parquet as random
float data is incompressible so don't focus too much on arrow vs
parquet.  I'm trying to illustrate the effect of having more record
batches.  I'm a little surprised by the difference between 10 and 100
batches in parquet but maybe someone else can guess.

Also, note that the row groups don't have to be super large.  Between
10 & 100 the effect is pretty small compared to the size of the data.
It's not really noticeable until you hit 1000 batches (which means 100
rows per batch).

Size on disk is only half the problem.  The other question is
processing time.  That's a little harder to measure, and it's
something that is always improving or has the potential to improve.
For example, I timed reading a single column from those files into
disk with the datasets API.  For each read I recorded the time of the
second read so this is cached-in-memory reads with little to no I/O
time.

IPC:
10 batches: 2.2 seconds
100 batches: 2.6 seconds
1000 batches: 4.9 seconds

Parquet:
10 batches: 0.25 seconds
100 batches: 1.75 seconds
1000 batches: 16.87 seconds

Again, don't worry too much about IPC vs. Parquet.  These results are
from 6.0.1 and I'm not using memory mapped files.  The performance in
7.0.0 will probably be better for this test because of better IPC
support for column projection pushdown (thanks Yue Ni!)

On Mon, Nov 22, 2021 at 8:17 AM Aldrin <akmon...@ucsc.edu.invalid> wrote:
>
> Hi Weston,
>
> This is slightly off-topic, but I'm curious if what you mentioned about the
> large metadata blocks (inlined below) also applies to IPC format?
>
> I am working with matrices and representing them as tables that can have
> hundreds of thousands of columns, but I'm splitting them into row groups to
> apply push down predicates.
>
> Finally, one other issue that comes into play, is the width of your
> > data.  Really wide datasets (e.g. tens of thousands of columns) suffer
> > from having rather large metadata blocks.  If your row groups start to
> > get small then you end up spending a lot of time parsing metadata and
> > much less time actually reading data.
>
>
>
> Thanks!
>
> > --
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz

Reply via email to