That's extremely helpful, thank you Todd. (And nice to "see" you again. I interviewed you years ago.)
-----Original Message----- From: Todd Lipcon [mailto:t...@cloudera.com] Sent: Thursday, February 25, 2016 2:23 PM To: dev@arrow.apache.org Subject: Re: Comparing with Parquet I would say that another key difference is that Parquet puts a lot of effort on encodings and compression, and Arrow is mostly about efficient representation to directly run operators over. eg simple arrays in memory vs bitpacked RLE-encoded data on disk. -Todd On Thu, Feb 25, 2016 at 11:20 AM, Andrew Brust <andrew.br...@bluebadgeinsights.com> wrote: > Is there a dumbed-down version of as summary for how and why in-mem and > on-disk formats differ? Is it mostly around aligning things for > SIMD/vectorization? > > There is probably some ignorance in my question, but I'm comfortable > with that. :-) > > -----Original Message----- > From: Wes McKinney [mailto:w...@cloudera.com] > Sent: Thursday, February 25, 2016 12:12 PM > To: dev@arrow.apache.org > Subject: Re: Comparing with Parquet > > We wrote about this in a recent blog post: > > http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast- > interoperable-in-memory-columnar-data-structure-standard/ > > "Apache Parquet is a compact, efficient columnar data storage designed for > storing large amounts of data stored in HDFS. Arrow is an ideal in-memory > “container” for data that has been deserialized from a Parquet file, and > similarly in-memory Arrow data can be serialized to Parquet and written out > to a filesystem like HDFS or Amazon S3. Arrow and Parquet are thus companion > projects." > > For example, one of my personal motivations for being involved in both Arrow > and Parquet is to use Arrow as the in-memory container for data deserialized > from Parquet for use in Python and R. > > - Wes > > On Thu, Feb 25, 2016 at 8:20 AM, Henry Robinson <he...@cloudera.com> wrote: >> Think of Parquet as a format well-suited to writing very large datasets to >> disk, whereas Arrow is a format most suited to efficient storage in memory. >> You might read Parquet files from disk, and then materialize them in memory >> in Arrow's format. >> >> Both formats are designed around the idiosyncrasies of the target medium: >> Parquet is not designed to support efficient random access because disks >> aren't good at that, but Arrow has fast random access as a core design >> principle, to give just one example. >> >> Henry >> >>> On Feb 25, 2016, at 8:10 AM, Sourav Mazumder <sourav.mazumde...@gmail.com> >>> wrote: >>> >>> Hi All, >>> >>> New to this. And still trying to figure out where exactly Arrow fits >>> in the ecosystem of various Big Data technologies. >>> >>> In that respect first thing which came to my mind is how does Arrow >>> compare with parquet. >>> >>> In my understanding Parquet also supports a very efficient columnar >>> format (with support for nested structure). It is already embraced >>> (supported) by various technologies like Impala (origin), Spark, Drill etc. >>> >>> The only think I see missing in Parquet is support for SIMD based >>> vectorized operations. >>> >>> Am I right or am I missing many other differences between Arrow and >>> parquet ? >>> >>> Regards, >>> Sourav -- Todd Lipcon Software Engineer, Cloudera