date:20160225

Re: Should Nullable be a nested type?

2016-02-25 Thread Wes McKinney

Inline On Thu, Feb 25, 2016 at 7:54 PM, Daniel Robinson wrote: > Thanks; specific responses inline below. > > I'm mostly persuaded, though—I never doubted the C++ code will end up > perfectly optimized. My main interest in this is the type definitions at > the metadata level, for which I really d

Re: Should Nullable be a nested type?

2016-02-25 Thread Daniel Robinson

Thanks; specific responses inline below. I'm mostly persuaded, though—I never doubted the C++ code will end up perfectly optimized. My main interest in this is the type definitions at the metadata level, for which I really do think Nullable should be considered a nested type like List or Struct.

Re: Should Nullable be a nested type?

2016-02-25 Thread Wes McKinney

Inline responses On Thu, Feb 25, 2016 at 4:23 PM, Daniel Robinson wrote: > Hi Wes, > > Thanks for the response! > > I see the appeal of representing all arrays at the data structure level as > "nullable," but I think I've convinced myself that it's better to represent > them all as non-nullable,

Re: Comparing with Parquet

2016-02-25 Thread Venkat Krishnamurthy

I think they're fundamentally orthogonal. Tachyon provides a full-fledged storage system that uses memory as one of its tiers, which leaves it up to applications as to how they represent their data structures. When applications want to use Tachyon, it's up to them to decide how to serialize their

Re: Should Nullable be a nested type?

2016-02-25 Thread Wes McKinney

hi Paul, responses inline. On Thu, Feb 25, 2016 at 5:05 PM, Paul Weiss wrote: > Hi, > > For what it's worth my thoughts are as follows... Mostly from being burned > BTW > > * for doubles and ints use a min value, in Java I use Integer.MIN_VALUE > and the double equivalent. It is reasonable to a

Re: Comparing with Parquet

2016-02-25 Thread Pedro Miguel Duarte

I was wondering if someone could also elaborate in the comparison with Tachyon (now called Alluxio) On Feb 25, 2016 5:08 PM, "Chenliang (Liang, DataSight)" < chenliang...@huawei.com> wrote: > In favor of Henry Robinson's points. > > In addition. Arrow is suitable for exchanging data high efficient

答复: Comparing with Parquet

2016-02-25 Thread Chenliang (Liang, DataSight)

In favor of Henry Robinson's points. In addition. Arrow is suitable for exchanging data high efficiently, but the data size may just support TB level. Parquet can support more bigger data, but the performance couldn't support fast query. So for PB level data and interactively query(second level

Re: Should Nullable be a nested type?

2016-02-25 Thread Paul Weiss

Hi, For what it's worth my thoughts are as follows... Mostly from being burned BTW * for doubles and ints use a min value, in Java I use Integer.MIN_VALUE and the double equivalent. It is reasonable to assume a program will have one "special" value instead of null * for collections or vectors j

Re: Should Nullable be a nested type?

2016-02-25 Thread Daniel Robinson

Hi Wes, Thanks for the response! I see the appeal of representing all arrays at the data structure level as "nullable," but I think I've convinced myself that it's better to represent them all as non-nullable, and bolt on "nullability" as a wrapper type. In addition to the reasons I mentioned yes

Re: Question about mutability

2016-02-25 Thread Todd Lipcon

On Thu, Feb 25, 2016 at 12:05 PM, Henry Robinson wrote: > The way I'm thinking about is that someone upstream makes a Kudu-specific > request, but as part of that request provides a descriptor of a shared > ring-buffer. Reading Arrow batches from and writing to that buffer is part > of a simple st

Re: Question about mutability

2016-02-25 Thread Henry Robinson

On 25 February 2016 at 11:57, Todd Lipcon wrote: > On Thu, Feb 25, 2016 at 11:48 AM, Henry Robinson > wrote: > > It seems like Arrow would benefit from a complementary effort to define a > > (simple) streaming memory transfer protocol between processes. Although > Wes > > mentioned RPC earlier,

RDMA, shared memory etc

2016-02-25 Thread Venkat Krishnamurthy

All We're very interested in exploring how Arrow can be used in traditional scientific computing installations. I've read several initial overviews, and am particularly interested in the mentions of RDMA where available, since that is a standard capability on HPC platforms. Are there specific de

Re: Comparing with Parquet

2016-02-25 Thread Jason Altekruse

That being said, sometimes encodings can be complementary to processing. Especially in the case of RLE, if a value is only stored once but stored in a way that it represents a value shared across many rows, you only need to do do the calculation once. This type of optimization is something that I

Re: Question about mutability

2016-02-25 Thread Todd Lipcon

On Thu, Feb 25, 2016 at 11:48 AM, Henry Robinson wrote: > It seems like Arrow would benefit from a complementary effort to define a > (simple) streaming memory transfer protocol between processes. Although Wes > mentioned RPC earlier, I'd hope that's really a placeholder for "fast > inter-process

RE: Comparing with Parquet

2016-02-25 Thread Andrew Brust

Also extremely helpful; thank you! -Original Message- From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, February 25, 2016 2:46 PM To: dev@arrow.apache.org Subject: Re: Comparing with Parquet To put it even more layman, on-disk formats are typically designed for more permane

Re: Question about mutability

2016-02-25 Thread Henry Robinson

On 25 February 2016 at 11:35, Todd Lipcon wrote: > One thing to keep in mind is that shared memory is not a performance > panacea. > > We did some experimentation (actually, an intern on our team did -- > credit where credit is due) with shared memory transport between the > Kudu C++ client and s

Re: Comparing with Parquet

2016-02-25 Thread Reynold Xin

To put it even more layman, on-disk formats are typically designed for more permanent storage on disks/ssds, and as a result the format would want to reduce the size, because: 1. For some clusters, they are bottlenecked by the amount of disk space available. In these cases, you'd want to compress

Re: Question about mutability

2016-02-25 Thread Todd Lipcon

One thing to keep in mind is that shared memory is not a performance panacea. We did some experimentation (actually, an intern on our team did -- credit where credit is due) with shared memory transport between the Kudu C++ client and server. What we found was that, for single-batch transfers, sha

RE: Comparing with Parquet

2016-02-25 Thread Andrew Brust

That's extremely helpful, thank you Todd. (And nice to "see" you again. I interviewed you years ago.) -Original Message- From: Todd Lipcon [mailto:t...@cloudera.com] Sent: Thursday, February 25, 2016 2:23 PM To: dev@arrow.apache.org Subject: Re: Comparing with Parquet I would say that

Re: Comparing with Parquet

2016-02-25 Thread Todd Lipcon

I would say that another key difference is that Parquet puts a lot of effort on encodings and compression, and Arrow is mostly about efficient representation to directly run operators over. eg simple arrays in memory vs bitpacked RLE-encoded data on disk. -Todd On Thu, Feb 25, 2016 at 11:20 AM, A

RE: Comparing with Parquet

2016-02-25 Thread Andrew Brust

Is there a dumbed-down version of as summary for how and why in-mem and on-disk formats differ? Is it mostly around aligning things for SIMD/vectorization? There is probably some ignorance in my question, but I'm comfortable with that. :-) -Original Message- From: Wes McKinney [mailto:

Re: Comparing with Parquet

2016-02-25 Thread Wes McKinney

We wrote about this in a recent blog post: http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/ "Apache Parquet is a compact, efficient columnar data storage designed for storing large amounts of data stored in HDFS. Arrow

Re: Question about mutability

2016-02-25 Thread Wes McKinney

hi Leif -- you've articulated almost exactly my vision for pandas interoperability with Spark via Arrow. There are some questions to sort out, like shared memory / mmap management and security / sandboxing questions, but in general moving toward a model where RPC's contain shared memory offsets to

Re: Comparing with Parquet

2016-02-25 Thread Henry Robinson

Think of Parquet as a format well-suited to writing very large datasets to disk, whereas Arrow is a format most suited to efficient storage in memory. You might read Parquet files from disk, and then materialize them in memory in Arrow's format. Both formats are designed around the idiosyncras

Comparing with Parquet

2016-02-25 Thread Sourav Mazumder

Hi All, New to this. And still trying to figure out where exactly Arrow fits in the ecosystem of various Big Data technologies. In that respect first thing which came to my mind is how does Arrow compare with parquet. In my understanding Parquet also supports a very efficient columnar format (wi

Re: Should Nullable be a nested type?

Re: Should Nullable be a nested type?

Re: Should Nullable be a nested type?

Re: Comparing with Parquet

Re: Should Nullable be a nested type?

Re: Comparing with Parquet

答复: Comparing with Parquet

Re: Should Nullable be a nested type?

Re: Should Nullable be a nested type?

Re: Question about mutability

Re: Question about mutability

RDMA, shared memory etc

Re: Comparing with Parquet

Re: Question about mutability

RE: Comparing with Parquet

Re: Question about mutability

Re: Comparing with Parquet

Re: Question about mutability

RE: Comparing with Parquet

Re: Comparing with Parquet

RE: Comparing with Parquet

Re: Comparing with Parquet

Re: Question about mutability

Re: Comparing with Parquet

Comparing with Parquet

25 matches

Site Navigation

Mail list logo

Footer information