答复: Comparing with Parquet

Chenliang (Liang, DataSight) Thu, 25 Feb 2016 17:09:07 -0800

In favor of Henry Robinson's points.

In addition. Arrow is suitable for exchanging data high efficiently, but the 
data size may just support TB level. Parquet can support more bigger data, but 
the performance couldn't support fast query.

So for PB level data and interactively query(second level), both couldn't solve?

Regards
Liang
-----邮件原件-----
发件人: Henry Robinson [mailto:[email protected]] 
发送时间: 2016年2月26日 0:20
收件人: [email protected]
主题: Re: Comparing with Parquet

Think of Parquet as a format well-suited to writing very large datasets to 
disk, whereas Arrow is a format most suited to efficient storage in memory. You 
might read Parquet files from disk, and then materialize them in memory in 
Arrow's format. 

Both formats are designed around the idiosyncrasies of the target medium: 
Parquet is not designed to support efficient random access because disks aren't 
good at that, but Arrow has fast random access  as a core design principle, to 
give just one example. 

Henry

> On Feb 25, 2016, at 8:10 AM, Sourav Mazumder <[email protected]> 
> wrote:
> 
> Hi All,
> 
> New to this. And still trying to figure out where exactly Arrow fits 
> in the ecosystem of various Big Data technologies.
> 
> In that respect first thing which came to my mind is how does Arrow 
> compare with parquet.
> 
> In my understanding Parquet also supports a very efficient columnar 
> format (with support for nested structure). It is already embraced 
> (supported) by various technologies like Impala (origin), Spark, Drill etc.
> 
> The only think I see missing in Parquet is support for SIMD based 
> vectorized operations.
> 
> Am I right or am I missing many other differences between Arrow and 
> parquet ?
> 
> Regards,
> Sourav

答复: Comparing with Parquet

Reply via email to