Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Paul Taylor Sat, 06 Jul 2019 11:17:34 -0700

Hi Micah,

Similar to Jacques I'm not disagreeing, but wondering if they belong inArrow vs. can be done externally. I'm mostly interested in changes thatmight impact SIMD processing, considering Arrow's already made consciousdesign decisions to trade memory for speed. Apologies in advance if I'vemisunderstood any of the proposals.

a. Add a run-length encoding scheme to efficiently represent repeated
values (the actual scheme encodes run ends instead of length to preserve
sub-linear random access).

Couldn't one do RLE at the buffer level via a customFixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?

b. Add a “packed” sparse representation (null values don’t take up
space in value buffers)

This would be fine for simple SIMD aggregations like count/avg/mean, butcompacting null slots complicates more advanced parallel routines thatexecute independently and rely on indices aligning with an element'slogical position.

It sounds like here the logical position depends on knowing the numberof nulls up to that point (via something like sequentially iteratingboth data and validity buffers). An efficient parallel routine wouldlikely need to scan beforehand to inflate the packed representation,where today it can simply slice/mmap the data buffer directly.

a. Add frame of reference integer encoding [7] (this allows for lower
bit-width encoding of integer types by subtracting a
“reference” value from all values in the buffer).

I agree this is useful, but couldn't it also live in userland/anExtensionType?

b. Add a sparse integer set encoding.  This encoding allows more
efficient encoding of validity bit-masks for cases when all values are
either null or not null.

If this is in reference to the discussion at link #4 [1], it soundssimilar to the BufferLayout metadata that used to exist but was removeda while back [2]. Knowing the buffer layouts allows an implementation togenerically elide any buffer at will, but would probably be a lot tobring back in. I can't say whether adding a different set of metadatawould raise the same concerns issues Jacques mentioned in the JIRAthread in [2].

Data compression.  Similar to encodings but compression is solely for
reduction of data at rest/on the wire.  The proposal is to allow
compression of individual buffers. Right now zstd is proposed, but I don’t
feel strongly on the specific technologies here.

What's the goal for this? Random element access into compressedin-memory columns, or compression at I/O boundaries?

* If the former, is Parquet a better alternative here? Again, I'mcautious about the impact to parallel routines. CPU speeds areplateauing while memory and tx/rx keep growing. Compressed elementaccess seems to be on the CPU side of that equation (meanwhile paralleldeflate already exists, and I remember seeing research into parallelinflate).

* If the later, could we do a comparison of Arrow dictionary-encoding +different compression formats, vs. building them into the spec? I knowcontent-aware compression yields significant size reductions, but Iwonder if the maintenance burden on Arrow contributors is worth the costvs. a simpler dictionary-encoding + streaming gzip.

Data Integrity.  While the arrow file format isn’t meant for archiving
data, I think it is important to allow for optional native data integrity
checks in the format.  To this end, I proposed a new “Digest” message type
that can be added after other messages to record a digest/hash of the
preceding data. I suggested xxhash, but I don’t have a strong opinion here,
as long as there is some minimal support that can potentially be expanded
later.

:thumbs up:


Best,
Paul

1.https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E

2.https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902


On 7/5/19 11:53 AM, Micah Kornfield wrote:

Hi Arrow-dev,

I’d like to make a straw-man proposal to cover some features that I think
would be useful to Arrow, and that I would like to make a proof-of-concept
implementation for in Java and C++. In particular, the proposal covers
allowing for smaller data sizes via compression and encoding [1][2][8],
data integrity [3] and avoiding unnecessary data transfer [4][5].

I’ve put together a PR [6] that has proposed changes to the flatbuffer
metadata to support the new features. The PR introduces:

A new “SparseRecordBatch” that can support one of multiple possible
encodings (both dense and sparse), compression and column elision.
-

A “Digest” message type to support optional data integrity.

Going into more details on the specific features in the PR:

Sparse encodings for arrays and buffers. The guiding principles behind
the suggested encodings are to support encodings that can be exploited by
compute engines for more efficient computation (I don’t think parquet style
bit-packing belongs in Arrow). While the encodings don’t maintain O(1)
data element access, they support sublinear, O(log(N)), element access. The
suggested encodings are:
1.

Array encodings:
1.

Add a run-length encoding scheme to efficiently represent repeated
values (the actual scheme encodes run ends instead of length
to preserve
sub-linear random access).
2.

Add a “packed” sparse representation (null values don’t take up
space in value buffers)
2.

Buffer encodings:
1.

Add frame of reference integer encoding [7] (this allows for lower
bit-width encoding of integer types by subtracting a
“reference” value from
all values in the buffer).
2.

Add a sparse integer set encoding. This encoding allows more
efficient encoding of validity bit-masks for cases when all values are
either null or not null.
2.

Data compression. Similar to encodings but compression is solely for
reduction of data at rest/on the wire. The proposal is to allow
compression of individual buffers. Right now zstd is proposed, but I don’t
feel strongly on the specific technologies here.
3.

Column Elision. For some use-cases, like structured logging, the
overhead of including array metadata for columns with no data present
represents non-negligible overhead. The proposal provides a mechanism for
omitting meta-data for such arrays.
4.

Data Integrity. While the arrow file format isn’t meant for archiving
data, I think it is important to allow for optional native data integrity
checks in the format. To this end, I proposed a new “Digest” message type
that can be added after other messages to record a digest/hash of the
preceding data. I suggested xxhash, but I don’t have a strong opinion here,
as long as there is some minimal support that can potentially be expanded
later.

In the proposal I chose to use Tables and Unions everywhere for flexibility
but in all likelihood some could be replaced by enums.

My initial plan would be to solely focus on an IPC mechanism that can send
a SparseRecordBatch and immediately translate it to a normal RecordBatch in
both Java and C++.

As a practical matter the proposal represents a lot of work to get an MVP
working in time for 1.0.0 release (provided they are accepted by the
community), so I'd greatly appreciate if anyone wants to collaborate on
this.

If it is easier I’m happy to start a separate thread for feature if people
feel like it would make the conversation easier. I can also create a
Google Doc for direct comments if that is preferred.

Thanks,

Micah

P.S. In the interest of full disclosure, these ideas evolved in
collaboration with Brian Hulette and other colleagues at Google who are
interested in making use of Arrow in both internal and external projects.

[1] https://issues.apache.org/jira/browse/ARROW-300

[2] https://issues.apache.org/jira/browse/ARROW-5224

[3]
https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E

[4]
https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E

[5]
https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812

[6] https://github.com/apache/arrow/pull/4815

[7]
https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/

[8] https://issues.apache.org/jira/browse/ARROW-5821

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to