I was wondering if someone could also elaborate in the comparison with Tachyon (now called Alluxio) On Feb 25, 2016 5:08 PM, "Chenliang (Liang, DataSight)" < chenliang...@huawei.com> wrote:
> In favor of Henry Robinson's points. > > In addition. Arrow is suitable for exchanging data high efficiently, but > the data size may just support TB level. Parquet can support more bigger > data, but the performance couldn't support fast query. > > So for PB level data and interactively query(second level), both couldn't > solve? > > Regards > Liang > -----邮件原件----- > 发件人: Henry Robinson [mailto:he...@cloudera.com] > 发送时间: 2016年2月26日 0:20 > 收件人: dev@arrow.apache.org > 主题: Re: Comparing with Parquet > > Think of Parquet as a format well-suited to writing very large datasets to > disk, whereas Arrow is a format most suited to efficient storage in memory. > You might read Parquet files from disk, and then materialize them in memory > in Arrow's format. > > Both formats are designed around the idiosyncrasies of the target medium: > Parquet is not designed to support efficient random access because disks > aren't good at that, but Arrow has fast random access as a core design > principle, to give just one example. > > Henry > > > On Feb 25, 2016, at 8:10 AM, Sourav Mazumder < > sourav.mazumde...@gmail.com> wrote: > > > > Hi All, > > > > New to this. And still trying to figure out where exactly Arrow fits > > in the ecosystem of various Big Data technologies. > > > > In that respect first thing which came to my mind is how does Arrow > > compare with parquet. > > > > In my understanding Parquet also supports a very efficient columnar > > format (with support for nested structure). It is already embraced > > (supported) by various technologies like Impala (origin), Spark, Drill > etc. > > > > The only think I see missing in Parquet is support for SIMD based > > vectorized operations. > > > > Am I right or am I missing many other differences between Arrow and > > parquet ? > > > > Regards, > > Sourav >