Inline
On Thu, Feb 25, 2016 at 7:54 PM, Daniel Robinson
wrote:
> Thanks; specific responses inline below.
>
> I'm mostly persuaded, though—I never doubted the C++ code will end up
> perfectly optimized. My main interest in this is the type definitions at
> the metadata level, for which I really d
Thanks; specific responses inline below.
I'm mostly persuaded, though—I never doubted the C++ code will end up
perfectly optimized. My main interest in this is the type definitions at
the metadata level, for which I really do think Nullable should be
considered a nested type like List or Struct.
Inline responses
On Thu, Feb 25, 2016 at 4:23 PM, Daniel Robinson
wrote:
> Hi Wes,
>
> Thanks for the response!
>
> I see the appeal of representing all arrays at the data structure level as
> "nullable," but I think I've convinced myself that it's better to represent
> them all as non-nullable,
I think they're fundamentally orthogonal.
Tachyon provides a full-fledged storage system that uses memory as one of
its tiers, which leaves it up to applications as to how they represent
their data structures. When applications want to use Tachyon, it's up to
them to decide how to serialize their
hi Paul,
responses inline.
On Thu, Feb 25, 2016 at 5:05 PM, Paul Weiss wrote:
> Hi,
>
> For what it's worth my thoughts are as follows... Mostly from being burned
> BTW
>
> * for doubles and ints use a min value, in Java I use Integer.MIN_VALUE
> and the double equivalent. It is reasonable to a
I was wondering if someone could also elaborate in the comparison with
Tachyon (now called Alluxio)
On Feb 25, 2016 5:08 PM, "Chenliang (Liang, DataSight)" <
chenliang...@huawei.com> wrote:
> In favor of Henry Robinson's points.
>
> In addition. Arrow is suitable for exchanging data high efficient
In favor of Henry Robinson's points.
In addition. Arrow is suitable for exchanging data high efficiently, but the
data size may just support TB level. Parquet can support more bigger data, but
the performance couldn't support fast query.
So for PB level data and interactively query(second level
Hi,
For what it's worth my thoughts are as follows... Mostly from being burned
BTW
* for doubles and ints use a min value, in Java I use Integer.MIN_VALUE
and the double equivalent. It is reasonable to assume a program will have
one "special" value instead of null
* for collections or vectors j
Hi Wes,
Thanks for the response!
I see the appeal of representing all arrays at the data structure level as
"nullable," but I think I've convinced myself that it's better to represent
them all as non-nullable, and bolt on "nullability" as a wrapper type. In
addition to the reasons I mentioned yes
On Thu, Feb 25, 2016 at 12:05 PM, Henry Robinson wrote:
> The way I'm thinking about is that someone upstream makes a Kudu-specific
> request, but as part of that request provides a descriptor of a shared
> ring-buffer. Reading Arrow batches from and writing to that buffer is part
> of a simple st
On 25 February 2016 at 11:57, Todd Lipcon wrote:
> On Thu, Feb 25, 2016 at 11:48 AM, Henry Robinson
> wrote:
> > It seems like Arrow would benefit from a complementary effort to define a
> > (simple) streaming memory transfer protocol between processes. Although
> Wes
> > mentioned RPC earlier,
All
We're very interested in exploring how Arrow can be used in traditional
scientific computing installations.
I've read several initial overviews, and am particularly interested in the
mentions of RDMA where available, since that is a standard capability on
HPC platforms.
Are there specific de
That being said, sometimes encodings can be complementary to processing.
Especially in the case of RLE, if a value is only stored once but stored in
a way that it represents a value shared across many rows, you only need to
do do the calculation once.
This type of optimization is something that I
On Thu, Feb 25, 2016 at 11:48 AM, Henry Robinson wrote:
> It seems like Arrow would benefit from a complementary effort to define a
> (simple) streaming memory transfer protocol between processes. Although Wes
> mentioned RPC earlier, I'd hope that's really a placeholder for "fast
> inter-process
Also extremely helpful; thank you!
-Original Message-
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, February 25, 2016 2:46 PM
To: dev@arrow.apache.org
Subject: Re: Comparing with Parquet
To put it even more layman, on-disk formats are typically designed for more
permane
On 25 February 2016 at 11:35, Todd Lipcon wrote:
> One thing to keep in mind is that shared memory is not a performance
> panacea.
>
> We did some experimentation (actually, an intern on our team did --
> credit where credit is due) with shared memory transport between the
> Kudu C++ client and s
To put it even more layman, on-disk formats are typically designed for more
permanent storage on disks/ssds, and as a result the format would want to
reduce the size, because:
1. For some clusters, they are bottlenecked by the amount of disk space
available. In these cases, you'd want to compress
One thing to keep in mind is that shared memory is not a performance panacea.
We did some experimentation (actually, an intern on our team did --
credit where credit is due) with shared memory transport between the
Kudu C++ client and server. What we found was that, for single-batch
transfers, sha
That's extremely helpful, thank you Todd.
(And nice to "see" you again. I interviewed you years ago.)
-Original Message-
From: Todd Lipcon [mailto:t...@cloudera.com]
Sent: Thursday, February 25, 2016 2:23 PM
To: dev@arrow.apache.org
Subject: Re: Comparing with Parquet
I would say that
I would say that another key difference is that Parquet puts a lot of
effort on encodings and compression, and Arrow is mostly about
efficient representation to directly run operators over. eg simple
arrays in memory vs bitpacked RLE-encoded data on disk.
-Todd
On Thu, Feb 25, 2016 at 11:20 AM, A
Is there a dumbed-down version of as summary for how and why in-mem and on-disk
formats differ? Is it mostly around aligning things for SIMD/vectorization?
There is probably some ignorance in my question, but I'm comfortable with that.
:-)
-Original Message-
From: Wes McKinney [mailto:
We wrote about this in a recent blog post:
http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/
"Apache Parquet is a compact, efficient columnar data storage designed
for storing large amounts of data stored in HDFS. Arrow
hi Leif -- you've articulated almost exactly my vision for pandas
interoperability with Spark via Arrow. There are some questions to sort
out, like shared memory / mmap management and security / sandboxing
questions, but in general moving toward a model where RPC's contain shared
memory offsets to
Think of Parquet as a format well-suited to writing very large datasets to
disk, whereas Arrow is a format most suited to efficient storage in memory. You
might read Parquet files from disk, and then materialize them in memory in
Arrow's format.
Both formats are designed around the idiosyncras
Hi All,
New to this. And still trying to figure out where exactly Arrow fits in the
ecosystem of various Big Data technologies.
In that respect first thing which came to my mind is how does Arrow compare
with parquet.
In my understanding Parquet also supports a very efficient columnar format
(wi
25 matches
Mail list logo