RE: Comparing with Parquet

Andrew Brust Thu, 25 Feb 2016 11:28:20 -0800

That's extremely helpful, thank you Todd.

(And nice to "see" you again.  I interviewed you years ago.)


-----Original Message-----
From: Todd Lipcon [mailto:t...@cloudera.com] 
Sent: Thursday, February 25, 2016 2:23 PM
To: dev@arrow.apache.org
Subject: Re: Comparing with Parquet

I would say that another key difference is that Parquet puts a lot of effort on 
encodings and compression, and Arrow is mostly about efficient representation 
to directly run operators over. eg simple arrays in memory vs bitpacked 
RLE-encoded data on disk.

-Todd

On Thu, Feb 25, 2016 at 11:20 AM, Andrew Brust 
<andrew.br...@bluebadgeinsights.com> wrote:
> Is there a dumbed-down version of as summary for how and why in-mem and 
> on-disk formats differ?  Is it mostly around aligning things for 
> SIMD/vectorization?
>
> There is probably some ignorance in my question, but I'm comfortable 
> with that. :-)
>
> -----Original Message-----
> From: Wes McKinney [mailto:w...@cloudera.com]
> Sent: Thursday, February 25, 2016 12:12 PM
> To: dev@arrow.apache.org
> Subject: Re: Comparing with Parquet
>
> We wrote about this in a recent blog post:
>
> http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-
> interoperable-in-memory-columnar-data-structure-standard/
>
> "Apache Parquet is a compact, efficient columnar data storage designed for 
> storing large amounts of data stored in HDFS. Arrow is an ideal in-memory 
> “container” for data that has been deserialized from a Parquet file, and 
> similarly in-memory Arrow data can be serialized to Parquet and written out 
> to a filesystem like HDFS or Amazon S3. Arrow and Parquet are thus companion 
> projects."
>
> For example, one of my personal motivations for being involved in both Arrow 
> and Parquet is to use Arrow as the in-memory container for data deserialized 
> from Parquet for use in Python and R.
>
> - Wes
>
> On Thu, Feb 25, 2016 at 8:20 AM, Henry Robinson <he...@cloudera.com> wrote:
>> Think of Parquet as a format well-suited to writing very large datasets to 
>> disk, whereas Arrow is a format most suited to efficient storage in memory. 
>> You might read Parquet files from disk, and then materialize them in memory 
>> in Arrow's format.
>>
>> Both formats are designed around the idiosyncrasies of the target medium: 
>> Parquet is not designed to support efficient random access because disks 
>> aren't good at that, but Arrow has fast random access  as a core design 
>> principle, to give just one example.
>>
>> Henry
>>
>>> On Feb 25, 2016, at 8:10 AM, Sourav Mazumder <sourav.mazumde...@gmail.com> 
>>> wrote:
>>>
>>> Hi All,
>>>
>>> New to this. And still trying to figure out where exactly Arrow fits 
>>> in the ecosystem of various Big Data technologies.
>>>
>>> In that respect first thing which came to my mind is how does Arrow 
>>> compare with parquet.
>>>
>>> In my understanding Parquet also supports a very efficient columnar 
>>> format (with support for nested structure). It is already embraced
>>> (supported) by various technologies like Impala (origin), Spark, Drill etc.
>>>
>>> The only think I see missing in Parquet is support for SIMD based 
>>> vectorized operations.
>>>
>>> Am I right or am I missing many other differences between Arrow and 
>>> parquet ?
>>>
>>> Regards,
>>> Sourav



--
Todd Lipcon
Software Engineer, Cloudera

RE: Comparing with Parquet

Reply via email to