Thanks all for the great feedback!

Thanks Daniel for the sample data sets. I loaded them up and they're quite
comparable in size to some of the data I'm dealing with. In my case the
shapes range from 150  to ~100million rows. Column wise they range from 2-3
columns to ~500,000 columns.

Thanks Wes for the insight regarding the inverse proportion between entropy
and Parquet's performance. I'm glad I understand why my benchmarking set
would have skewed the results. The data sets I'm dealing with will be
fairly random, but have very low cardinality. In that benchmark I used
values that range from 1 to 9. So if I understand you correctly
repetitiveness is key for Parquet's performance as opposed to cardinality
(even though the lower the cardinality the more likely I am to have
repeated values because of the small number of possibilities).

Thanks Ted for the insight as well. Can I get some clarification when you
said *You **also have a very small number of rows which can penalize the
system that expects **to amortize column meta data over more data. *If I
understand you correctly are you saying there's a column metadata overhead
and this overhead is amortized or "paid off" when I have a large amount of
data. If that's the the case, is the said amortization also applicable in
the case where I used 1million rows?

Kind Regards
Simba

On Wed, 24 Jan 2018 at 21:30 Daniel Lemire <lem...@gmail.com> wrote:

> Here are some realistic tabular data sets...
>
> https://github.com/lemire/RealisticTabularDataSets
>
> They are small by modern standards but they are also one GitHub clone away.
>
> - Daniel
>
> On Wed, Jan 24, 2018 at 2:26 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > Thanks Ted. I will echo these comments and recommend to run tests on
> > larger and preferably "real" datasets rather than randomly generated
> > ones. The more repetition and less entropy in a dataset, the better
> > Parquet performs relative to other storage options. Web-scale datasets
> > often exhibit these characteristics.
> >
> > If you can publish your benchmarking code that would also be helpful!
> >
> > best
> > Wes
> >
> > On Wed, Jan 24, 2018 at 1:21 PM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> > > Simba
> > >
> > > Nice summary. I think that there may be some issues with your tests. In
> > > particular, you are storing essentially uniform random values. That
> might
> > > be a viable test in some situations, there are many where there is
> > > considerably less entropy in the data being stored. For instance, if
> you
> > > store measurements, it is very typical to have very strong
> correlations.
> > > Likewise if the rows are, say, the time evolution of an optimization.
> You
> > > also have a very small number of rows which can penalize system that
> > expect
> > > to amortize column meta data over more data.
> > >
> > > This test might match your situation, but I would be leery of drawing
> > > overly broad conclusions from this single data point.
> > >
> > >
> > >
> > > On Jan 24, 2018 5:44 AM, "simba nyatsanga" <simnyatsa...@gmail.com>
> > wrote:
> > >
> > >> Hi Uwe, thanks.
> > >>
> > >> I've attached a Google Sheet link
> > >>
> > >> https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-
> > >> SoFYrRcfi1siYKFQ/edit#gid=0
> > >>
> > >> Kind Regards
> > >> Simba
> > >>
> > >> On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn <uw...@xhochy.com> wrote:
> > >>
> > >> > Hello Simba,
> > >> >
> > >> > your plots did not come through. Try uploading them somewhere and
> link
> > >> > to them in the mails. Attachments are always stripped on Apache
> > >> > mailing lists.
> > >> > Uwe
> > >> >
> > >> >
> > >> > On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote:
> > >> > > Hi Everyone,
> > >> > >
> > >> > > I did some benchmarking to compare the disk size performance when
> > >> > > writing Pandas DataFrames to parquet files using Snappy and Brotli
> > >> > > compression. I then compared these numbers with those of my
> current
> > >> > > file storage solution.>
> > >> > > In my current (non Arrow+Parquet solution), every column in a
> > >> > > DataFrame is extracted as NumPy array then compressed with blosc
> and
> > >> > > stored as a binary file. Additionally there's a small accompanying
> > >> > > json file with some metadata. Attached are my results for several
> > long
> > >> > > and wide DataFrames:>
> > >> > > Screen Shot 2018-01-24 at 14.40.48.png
> > >> > >
> > >> > > I was also able to correlate this finding by looking at the number
> > of
> > >> > > allocated blocks:>
> > >> > > Screen Shot 2018-01-24 at 14.45.29.png
> > >> > >
> > >> > > From what I gather Brotli and Snappy perform significantly better
> > for
> > >> > > wide DataFrames. However the reverse is true for long DataFrames.>
> > >> > > The DataFrames used in the benchmark are entirely composed of
> floats
> > >> > > and my understanding is that there's type specific encoding
> employed
> > >> > > on the parquet file. Additionally the compression codecs are
> applied
> > >> > > to individual segments of the parquet file.>
> > >> > > I'd like to get a better understanding of this disk size disparity
> > >> > > specifically if there are any additional encoding/compression
> > headers
> > >> > > added to the parquet file in the long DataFrames case.>
> > >> > > Kind Regards
> > >> > > Simba
> > >> >
> > >> >
> > >>
> >
>

Reply via email to