Hi Everyone, Just an update on the above questions. I've updated the numbers in Google sheet using data with less entropy here: https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0
I've also got the benchmarking code. Although some of the data examples might be small by web scale standards, the small sizes represent a significant amount (but not all) of the computation inputs that I am dealing with and felt it necessary to include benchmarking against them. Additionally, due to the way the data is collected, low cardinality is guaranteed but not necessarily repetition in most of the columns on data. All the data sets and benchmarking code is reproducible here: https://github.com/simnyatsanga/python-notebooks/blob/master/arrow_parquet_benchmark.ipynb Hopefully this adds more clarity to the questions. Kind Regards Simba On Thu, 25 Jan 2018 at 15:37 simba nyatsanga <simnyatsa...@gmail.com> wrote: > Thanks all for the great feedback! > > Thanks Daniel for the sample data sets. I loaded them up and they're quite > comparable in size to some of the data I'm dealing with. In my case the > shapes range from 150 to ~100million rows. Column wise they range from 2-3 > columns to ~500,000 columns. > > Thanks Wes for the insight regarding the inverse proportion between > entropy and Parquet's performance. I'm glad I understand why my > benchmarking set would have skewed the results. The data sets I'm dealing > with will be fairly random, but have very low cardinality. In that > benchmark I used values that range from 1 to 9. So if I understand you > correctly repetitiveness is key for Parquet's performance as opposed to > cardinality (even though the lower the cardinality the more likely I am to > have repeated values because of the small number of possibilities). > > Thanks Ted for the insight as well. Can I get some clarification when you > said *You **also have a very small number of rows which can penalize the > system that expects **to amortize column meta data over more data. *If I > understand you correctly are you saying there's a column metadata overhead > and this overhead is amortized or "paid off" when I have a large amount of > data. If that's the the case, is the said amortization also applicable in > the case where I used 1million rows? > > Kind Regards > Simba > > On Wed, 24 Jan 2018 at 21:30 Daniel Lemire <lem...@gmail.com> wrote: > >> Here are some realistic tabular data sets... >> >> https://github.com/lemire/RealisticTabularDataSets >> >> They are small by modern standards but they are also one GitHub clone >> away. >> >> - Daniel >> >> On Wed, Jan 24, 2018 at 2:26 PM, Wes McKinney <wesmck...@gmail.com> >> wrote: >> >> > Thanks Ted. I will echo these comments and recommend to run tests on >> > larger and preferably "real" datasets rather than randomly generated >> > ones. The more repetition and less entropy in a dataset, the better >> > Parquet performs relative to other storage options. Web-scale datasets >> > often exhibit these characteristics. >> > >> > If you can publish your benchmarking code that would also be helpful! >> > >> > best >> > Wes >> > >> > On Wed, Jan 24, 2018 at 1:21 PM, Ted Dunning <ted.dunn...@gmail.com> >> > wrote: >> > > Simba >> > > >> > > Nice summary. I think that there may be some issues with your tests. >> In >> > > particular, you are storing essentially uniform random values. That >> might >> > > be a viable test in some situations, there are many where there is >> > > considerably less entropy in the data being stored. For instance, if >> you >> > > store measurements, it is very typical to have very strong >> correlations. >> > > Likewise if the rows are, say, the time evolution of an optimization. >> You >> > > also have a very small number of rows which can penalize system that >> > expect >> > > to amortize column meta data over more data. >> > > >> > > This test might match your situation, but I would be leery of drawing >> > > overly broad conclusions from this single data point. >> > > >> > > >> > > >> > > On Jan 24, 2018 5:44 AM, "simba nyatsanga" <simnyatsa...@gmail.com> >> > wrote: >> > > >> > >> Hi Uwe, thanks. >> > >> >> > >> I've attached a Google Sheet link >> > >> >> > >> https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i- >> > >> SoFYrRcfi1siYKFQ/edit#gid=0 >> > >> >> > >> Kind Regards >> > >> Simba >> > >> >> > >> On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn <uw...@xhochy.com> wrote: >> > >> >> > >> > Hello Simba, >> > >> > >> > >> > your plots did not come through. Try uploading them somewhere and >> link >> > >> > to them in the mails. Attachments are always stripped on Apache >> > >> > mailing lists. >> > >> > Uwe >> > >> > >> > >> > >> > >> > On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote: >> > >> > > Hi Everyone, >> > >> > > >> > >> > > I did some benchmarking to compare the disk size performance when >> > >> > > writing Pandas DataFrames to parquet files using Snappy and >> Brotli >> > >> > > compression. I then compared these numbers with those of my >> current >> > >> > > file storage solution.> >> > >> > > In my current (non Arrow+Parquet solution), every column in a >> > >> > > DataFrame is extracted as NumPy array then compressed with blosc >> and >> > >> > > stored as a binary file. Additionally there's a small >> accompanying >> > >> > > json file with some metadata. Attached are my results for several >> > long >> > >> > > and wide DataFrames:> >> > >> > > Screen Shot 2018-01-24 at 14.40.48.png >> > >> > > >> > >> > > I was also able to correlate this finding by looking at the >> number >> > of >> > >> > > allocated blocks:> >> > >> > > Screen Shot 2018-01-24 at 14.45.29.png >> > >> > > >> > >> > > From what I gather Brotli and Snappy perform significantly better >> > for >> > >> > > wide DataFrames. However the reverse is true for long >> DataFrames.> >> > >> > > The DataFrames used in the benchmark are entirely composed of >> floats >> > >> > > and my understanding is that there's type specific encoding >> employed >> > >> > > on the parquet file. Additionally the compression codecs are >> applied >> > >> > > to individual segments of the parquet file.> >> > >> > > I'd like to get a better understanding of this disk size >> disparity >> > >> > > specifically if there are any additional encoding/compression >> > headers >> > >> > > added to the parquet file in the long DataFrames case.> >> > >> > > Kind Regards >> > >> > > Simba >> > >> > >> > >> > >> > >> >> > >> >