Re: Should Nullable be a nested type?

2016-02-25 Thread Wes McKinney
Inline On Thu, Feb 25, 2016 at 7:54 PM, Daniel Robinson wrote: > Thanks; specific responses inline below. > > I'm mostly persuaded, though—I never doubted the C++ code will end up > perfectly optimized. My main interest in this is the type definitions at > the metadata level, for which I really d

Re: Should Nullable be a nested type?

2016-02-25 Thread Daniel Robinson
Thanks; specific responses inline below. I'm mostly persuaded, though—I never doubted the C++ code will end up perfectly optimized. My main interest in this is the type definitions at the metadata level, for which I really do think Nullable should be considered a nested type like List or Struct.

Re: Should Nullable be a nested type?

2016-02-25 Thread Wes McKinney
Inline responses On Thu, Feb 25, 2016 at 4:23 PM, Daniel Robinson wrote: > Hi Wes, > > Thanks for the response! > > I see the appeal of representing all arrays at the data structure level as > "nullable," but I think I've convinced myself that it's better to represent > them all as non-nullable,

Re: Comparing with Parquet

2016-02-25 Thread Venkat Krishnamurthy
I think they're fundamentally orthogonal. Tachyon provides a full-fledged storage system that uses memory as one of its tiers, which leaves it up to applications as to how they represent their data structures. When applications want to use Tachyon, it's up to them to decide how to serialize their

Re: Should Nullable be a nested type?

2016-02-25 Thread Wes McKinney
hi Paul, responses inline. On Thu, Feb 25, 2016 at 5:05 PM, Paul Weiss wrote: > Hi, > > For what it's worth my thoughts are as follows... Mostly from being burned > BTW > > * for doubles and ints use a min value, in Java I use Integer.MIN_VALUE > and the double equivalent. It is reasonable to a

Re: Comparing with Parquet

2016-02-25 Thread Pedro Miguel Duarte
I was wondering if someone could also elaborate in the comparison with Tachyon (now called Alluxio) On Feb 25, 2016 5:08 PM, "Chenliang (Liang, DataSight)" < chenliang...@huawei.com> wrote: > In favor of Henry Robinson's points. > > In addition. Arrow is suitable for exchanging data high efficient

答复: Comparing with Parquet

2016-02-25 Thread Chenliang (Liang, DataSight)
In favor of Henry Robinson's points. In addition. Arrow is suitable for exchanging data high efficiently, but the data size may just support TB level. Parquet can support more bigger data, but the performance couldn't support fast query. So for PB level data and interactively query(second level

Re: Should Nullable be a nested type?

2016-02-25 Thread Paul Weiss
Hi, For what it's worth my thoughts are as follows... Mostly from being burned BTW * for doubles and ints use a min value, in Java I use Integer.MIN_VALUE and the double equivalent. It is reasonable to assume a program will have one "special" value instead of null * for collections or vectors j

Re: Should Nullable be a nested type?

2016-02-25 Thread Daniel Robinson
Hi Wes, Thanks for the response! I see the appeal of representing all arrays at the data structure level as "nullable," but I think I've convinced myself that it's better to represent them all as non-nullable, and bolt on "nullability" as a wrapper type. In addition to the reasons I mentioned yes

Re: Question about mutability

2016-02-25 Thread Todd Lipcon
On Thu, Feb 25, 2016 at 12:05 PM, Henry Robinson wrote: > The way I'm thinking about is that someone upstream makes a Kudu-specific > request, but as part of that request provides a descriptor of a shared > ring-buffer. Reading Arrow batches from and writing to that buffer is part > of a simple st

Re: Question about mutability

2016-02-25 Thread Henry Robinson
On 25 February 2016 at 11:57, Todd Lipcon wrote: > On Thu, Feb 25, 2016 at 11:48 AM, Henry Robinson > wrote: > > It seems like Arrow would benefit from a complementary effort to define a > > (simple) streaming memory transfer protocol between processes. Although > Wes > > mentioned RPC earlier,

RDMA, shared memory etc

2016-02-25 Thread Venkat Krishnamurthy
All We're very interested in exploring how Arrow can be used in traditional scientific computing installations. I've read several initial overviews, and am particularly interested in the mentions of RDMA where available, since that is a standard capability on HPC platforms. Are there specific de

Re: Comparing with Parquet

2016-02-25 Thread Jason Altekruse
That being said, sometimes encodings can be complementary to processing. Especially in the case of RLE, if a value is only stored once but stored in a way that it represents a value shared across many rows, you only need to do do the calculation once. This type of optimization is something that I

Re: Question about mutability

2016-02-25 Thread Todd Lipcon
On Thu, Feb 25, 2016 at 11:48 AM, Henry Robinson wrote: > It seems like Arrow would benefit from a complementary effort to define a > (simple) streaming memory transfer protocol between processes. Although Wes > mentioned RPC earlier, I'd hope that's really a placeholder for "fast > inter-process

RE: Comparing with Parquet

2016-02-25 Thread Andrew Brust
Also extremely helpful; thank you! -Original Message- From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, February 25, 2016 2:46 PM To: dev@arrow.apache.org Subject: Re: Comparing with Parquet To put it even more layman, on-disk formats are typically designed for more permane

Re: Question about mutability

2016-02-25 Thread Henry Robinson
On 25 February 2016 at 11:35, Todd Lipcon wrote: > One thing to keep in mind is that shared memory is not a performance > panacea. > > We did some experimentation (actually, an intern on our team did -- > credit where credit is due) with shared memory transport between the > Kudu C++ client and s

Re: Comparing with Parquet

2016-02-25 Thread Reynold Xin
To put it even more layman, on-disk formats are typically designed for more permanent storage on disks/ssds, and as a result the format would want to reduce the size, because: 1. For some clusters, they are bottlenecked by the amount of disk space available. In these cases, you'd want to compress

Re: Question about mutability

2016-02-25 Thread Todd Lipcon
One thing to keep in mind is that shared memory is not a performance panacea. We did some experimentation (actually, an intern on our team did -- credit where credit is due) with shared memory transport between the Kudu C++ client and server. What we found was that, for single-batch transfers, sha

RE: Comparing with Parquet

2016-02-25 Thread Andrew Brust
That's extremely helpful, thank you Todd. (And nice to "see" you again. I interviewed you years ago.) -Original Message- From: Todd Lipcon [mailto:t...@cloudera.com] Sent: Thursday, February 25, 2016 2:23 PM To: dev@arrow.apache.org Subject: Re: Comparing with Parquet I would say that

Re: Comparing with Parquet

2016-02-25 Thread Todd Lipcon
I would say that another key difference is that Parquet puts a lot of effort on encodings and compression, and Arrow is mostly about efficient representation to directly run operators over. eg simple arrays in memory vs bitpacked RLE-encoded data on disk. -Todd On Thu, Feb 25, 2016 at 11:20 AM, A

RE: Comparing with Parquet

2016-02-25 Thread Andrew Brust
Is there a dumbed-down version of as summary for how and why in-mem and on-disk formats differ? Is it mostly around aligning things for SIMD/vectorization? There is probably some ignorance in my question, but I'm comfortable with that. :-) -Original Message- From: Wes McKinney [mailto:

Re: Comparing with Parquet

2016-02-25 Thread Wes McKinney
We wrote about this in a recent blog post: http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/ "Apache Parquet is a compact, efficient columnar data storage designed for storing large amounts of data stored in HDFS. Arrow

Re: Question about mutability

2016-02-25 Thread Wes McKinney
hi Leif -- you've articulated almost exactly my vision for pandas interoperability with Spark via Arrow. There are some questions to sort out, like shared memory / mmap management and security / sandboxing questions, but in general moving toward a model where RPC's contain shared memory offsets to

Re: Comparing with Parquet

2016-02-25 Thread Henry Robinson
Think of Parquet as a format well-suited to writing very large datasets to disk, whereas Arrow is a format most suited to efficient storage in memory. You might read Parquet files from disk, and then materialize them in memory in Arrow's format. Both formats are designed around the idiosyncras

Comparing with Parquet

2016-02-25 Thread Sourav Mazumder
Hi All, New to this. And still trying to figure out where exactly Arrow fits in the ecosystem of various Big Data technologies. In that respect first thing which came to my mind is how does Arrow compare with parquet. In my understanding Parquet also supports a very efficient columnar format (wi