That being said, sometimes encodings can be complementary to processing.
Especially in the case of RLE, if a value is only stored once but stored in
a way that it represents a value shared across many rows, you only need to
do do the calculation once.

This type of optimization is something that I think would be useful to
include in the Arrow spec, especially in the case of sparse datasets.
Currently the java implementation of Arrow, which lacks support for
different encodings, will allocate memory to represent a possible value
(including four unused bytes for each null int value to get random access
to the non-null values) as well as a secondary vector to represent the
nullability of each value, even in the case where all values are null or
all values are non-null.

The approach has two downsides, obviously there is obvious redundancy that
is stored in full fidelity rather than being compressed, but it also
spreads out all of the information about what values are present throughout
the vector. We then end up having run a filter or a calculation on a bunch
of rows that should have just been a single evaluation if we had kept
better track of what we had written into the vector and appropriately
stored that information efficiently.

There is a trade-off to doing this kind of analysis on the fly mid-query,
as there is overhead to maintaining statistics about the batch, but as this
data is going to be read out of efficient disk formats like parquet in many
cases, it makes sense to preserve this information where it is useful for
processing.

On Thu, Feb 25, 2016 at 11:48 AM, Andrew Brust <
andrew.br...@bluebadgeinsights.com> wrote:

> Also extremely helpful; thank you!
>
> -----Original Message-----
> From: Reynold Xin [mailto:r...@databricks.com]
> Sent: Thursday, February 25, 2016 2:46 PM
> To: dev@arrow.apache.org
> Subject: Re: Comparing with Parquet
>
> To put it even more layman, on-disk formats are typically designed for
> more permanent storage on disks/ssds, and as a result the format would want
> to reduce the size, because:
>
> 1. For some clusters, they are bottlenecked by the amount of disk space
> available. In these cases, you'd want to compress heavily the data.
>
> 2. Disks are slower than memory, and as a a result things might speed up
> if data is compressed (use more cpu cycles).
>
>
> For in-memory format, they are typically ephemeral, and have opposite
> characteristics.
>
>
>
>
> On Thu, Feb 25, 2016 at 11:27 AM, Andrew Brust <
> andrew.br...@bluebadgeinsights.com> wrote:
>
> > That's extremely helpful, thank you Todd.
> >
> > (And nice to "see" you again.  I interviewed you years ago.)
> >
> > -----Original Message-----
> > From: Todd Lipcon [mailto:t...@cloudera.com]
> > Sent: Thursday, February 25, 2016 2:23 PM
> > To: dev@arrow.apache.org
> > Subject: Re: Comparing with Parquet
> >
> > I would say that another key difference is that Parquet puts a lot of
> > effort on encodings and compression, and Arrow is mostly about
> > efficient representation to directly run operators over. eg simple
> > arrays in memory vs bitpacked RLE-encoded data on disk.
> >
> > -Todd
> >
> > On Thu, Feb 25, 2016 at 11:20 AM, Andrew Brust <
> > andrew.br...@bluebadgeinsights.com> wrote:
> > > Is there a dumbed-down version of as summary for how and why in-mem
> > > and
> > on-disk formats differ?  Is it mostly around aligning things for
> > SIMD/vectorization?
> > >
> > > There is probably some ignorance in my question, but I'm comfortable
> > > with that. :-)
> > >
> > > -----Original Message-----
> > > From: Wes McKinney [mailto:w...@cloudera.com]
> > > Sent: Thursday, February 25, 2016 12:12 PM
> > > To: dev@arrow.apache.org
> > > Subject: Re: Comparing with Parquet
> > >
> > > We wrote about this in a recent blog post:
> > >
> > > http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fas
> > > t- interoperable-in-memory-columnar-data-structure-standard/
> > >
> > > "Apache Parquet is a compact, efficient columnar data storage
> > > designed
> > for storing large amounts of data stored in HDFS. Arrow is an ideal
> > in-memory “container” for data that has been deserialized from a
> > Parquet file, and similarly in-memory Arrow data can be serialized to
> > Parquet and written out to a filesystem like HDFS or Amazon S3. Arrow
> > and Parquet are thus companion projects."
> > >
> > > For example, one of my personal motivations for being involved in
> > > both
> > Arrow and Parquet is to use Arrow as the in-memory container for data
> > deserialized from Parquet for use in Python and R.
> > >
> > > - Wes
> > >
> > > On Thu, Feb 25, 2016 at 8:20 AM, Henry Robinson <he...@cloudera.com>
> > wrote:
> > >> Think of Parquet as a format well-suited to writing very large
> > >> datasets
> > to disk, whereas Arrow is a format most suited to efficient storage in
> > memory. You might read Parquet files from disk, and then materialize
> > them in memory in Arrow's format.
> > >>
> > >> Both formats are designed around the idiosyncrasies of the target
> > medium: Parquet is not designed to support efficient random access
> > because disks aren't good at that, but Arrow has fast random access
> > as a core design principle, to give just one example.
> > >>
> > >> Henry
> > >>
> > >>> On Feb 25, 2016, at 8:10 AM, Sourav Mazumder <
> > sourav.mazumde...@gmail.com> wrote:
> > >>>
> > >>> Hi All,
> > >>>
> > >>> New to this. And still trying to figure out where exactly Arrow
> > >>> fits in the ecosystem of various Big Data technologies.
> > >>>
> > >>> In that respect first thing which came to my mind is how does
> > >>> Arrow compare with parquet.
> > >>>
> > >>> In my understanding Parquet also supports a very efficient
> > >>> columnar format (with support for nested structure). It is already
> > >>> embraced
> > >>> (supported) by various technologies like Impala (origin), Spark,
> > >>> Drill
> > etc.
> > >>>
> > >>> The only think I see missing in Parquet is support for SIMD based
> > >>> vectorized operations.
> > >>>
> > >>> Am I right or am I missing many other differences between Arrow
> > >>> and parquet ?
> > >>>
> > >>> Regards,
> > >>> Sourav
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Reply via email to