Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Jacques Nadeau Fri, 05 Jul 2019 12:32:32 -0700

Hey Micah, you're formatting seems to be messed up on this mail. Some kind
of copy/paste error?


On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield <[email protected]>
wrote:

> Hi Arrow-dev,
>
> I’d like to make a straw-man proposal to cover some features that I think
> would be useful to Arrow, and that I would like to make a proof-of-concept
> implementation for in Java and C++.  In particular, the proposal covers
> allowing for smaller data sizes via compression and encoding [1][2][8],
> data integrity [3] and avoiding unnecessary data transfer [4][5].
>
> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> metadata to support the new features.  The PR introduces:
>
>    -
>
>    A new “SparseRecordBatch” that can support one of multiple possible
>    encodings (both dense and sparse), compression and column elision.
>    -
>
>    A “Digest” message type to support optional data integrity.
>
>
> Going into more details on the specific features in the PR:
>
>    1.
>
>    Sparse encodings for arrays and buffers.  The guiding principles behind
>    the suggested encodings are to support encodings that can be exploited
> by
>    compute engines for more efficient computation (I don’t think parquet
> style
>    bit-packing belongs in Arrow).  While the encodings don’t maintain O(1)
>    data element access, they support sublinear, O(log(N)), element access.
> The
>    suggested encodings are:
>    1.
>
>       Array encodings:
>       1.
>
>          Add a run-length encoding scheme to efficiently represent repeated
>          values (the actual scheme encodes run ends instead of length
> to preserve
>          sub-linear random access).
>          2.
>
>          Add a “packed” sparse representation (null values don’t take up
>          space in value buffers)
>          2.
>
>       Buffer encodings:
>       1.
>
>          Add frame of reference integer encoding [7] (this allows for lower
>          bit-width encoding of integer types by subtracting a
> “reference” value from
>          all values in the buffer).
>          2.
>
>          Add a sparse integer set encoding.  This encoding allows more
>          efficient encoding of validity bit-masks for cases when all
> values are
>          either null or not null.
>          2.
>
>    Data compression.  Similar to encodings but compression is solely for
>    reduction of data at rest/on the wire.  The proposal is to allow
>    compression of individual buffers. Right now zstd is proposed, but I
> don’t
>    feel strongly on the specific technologies here.
>    3.
>
>    Column Elision.  For some use-cases, like structured logging, the
>    overhead of including array metadata for columns with no data present
>    represents non-negligible overhead.   The proposal provides a mechanism
> for
>    omitting meta-data for such arrays.
>    4.
>
>    Data Integrity.  While the arrow file format isn’t meant for archiving
>    data, I think it is important to allow for optional native data
> integrity
>    checks in the format.  To this end, I proposed a new “Digest” message
> type
>    that can be added after other messages to record a digest/hash of the
>    preceding data. I suggested xxhash, but I don’t have a strong opinion
> here,
>    as long as there is some minimal support that can potentially be
> expanded
>    later.
>
>
> In the proposal I chose to use Tables and Unions everywhere for flexibility
> but in all likelihood some could be replaced by enums.
>
> My initial plan would be to solely focus on an IPC mechanism that can send
> a SparseRecordBatch and immediately translate it to a normal RecordBatch in
> both Java and C++.
>
> As a practical matter the proposal represents a lot of work to get an MVP
> working in time for 1.0.0 release (provided they are accepted by the
> community), so I'd greatly appreciate if anyone wants to collaborate on
> this.
>
> If it is easier I’m happy to start a separate thread for feature if people
> feel like it would make the conversation easier.  I can also create a
> Google Doc for direct comments if that is preferred.
>
> Thanks,
>
> Micah
>
>
>
> P.S. In the interest of full disclosure, these ideas evolved in
> collaboration with Brian Hulette and other colleagues at Google who are
> interested in making use of Arrow in both internal and external projects.
>
> [1] https://issues.apache.org/jira/browse/ARROW-300
>
> [2]  https://issues.apache.org/jira/browse/ARROW-5224
>
> [3]
>
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
>
> [4]
>
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>
> [5]
>
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
>
> [6] https://github.com/apache/arrow/pull/4815
>
> [7]
>
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
>
> [8] https://issues.apache.org/jira/browse/ARROW-5821
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to