Hey Micah, you're formatting seems to be messed up on this mail. Some kind of copy/paste error?
On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Arrow-dev, > > I’d like to make a straw-man proposal to cover some features that I think > would be useful to Arrow, and that I would like to make a proof-of-concept > implementation for in Java and C++. In particular, the proposal covers > allowing for smaller data sizes via compression and encoding [1][2][8], > data integrity [3] and avoiding unnecessary data transfer [4][5]. > > I’ve put together a PR [6] that has proposed changes to the flatbuffer > metadata to support the new features. The PR introduces: > > - > > A new “SparseRecordBatch” that can support one of multiple possible > encodings (both dense and sparse), compression and column elision. > - > > A “Digest” message type to support optional data integrity. > > > Going into more details on the specific features in the PR: > > 1. > > Sparse encodings for arrays and buffers. The guiding principles behind > the suggested encodings are to support encodings that can be exploited > by > compute engines for more efficient computation (I don’t think parquet > style > bit-packing belongs in Arrow). While the encodings don’t maintain O(1) > data element access, they support sublinear, O(log(N)), element access. > The > suggested encodings are: > 1. > > Array encodings: > 1. > > Add a run-length encoding scheme to efficiently represent repeated > values (the actual scheme encodes run ends instead of length > to preserve > sub-linear random access). > 2. > > Add a “packed” sparse representation (null values don’t take up > space in value buffers) > 2. > > Buffer encodings: > 1. > > Add frame of reference integer encoding [7] (this allows for lower > bit-width encoding of integer types by subtracting a > “reference” value from > all values in the buffer). > 2. > > Add a sparse integer set encoding. This encoding allows more > efficient encoding of validity bit-masks for cases when all > values are > either null or not null. > 2. > > Data compression. Similar to encodings but compression is solely for > reduction of data at rest/on the wire. The proposal is to allow > compression of individual buffers. Right now zstd is proposed, but I > don’t > feel strongly on the specific technologies here. > 3. > > Column Elision. For some use-cases, like structured logging, the > overhead of including array metadata for columns with no data present > represents non-negligible overhead. The proposal provides a mechanism > for > omitting meta-data for such arrays. > 4. > > Data Integrity. While the arrow file format isn’t meant for archiving > data, I think it is important to allow for optional native data > integrity > checks in the format. To this end, I proposed a new “Digest” message > type > that can be added after other messages to record a digest/hash of the > preceding data. I suggested xxhash, but I don’t have a strong opinion > here, > as long as there is some minimal support that can potentially be > expanded > later. > > > In the proposal I chose to use Tables and Unions everywhere for flexibility > but in all likelihood some could be replaced by enums. > > My initial plan would be to solely focus on an IPC mechanism that can send > a SparseRecordBatch and immediately translate it to a normal RecordBatch in > both Java and C++. > > As a practical matter the proposal represents a lot of work to get an MVP > working in time for 1.0.0 release (provided they are accepted by the > community), so I'd greatly appreciate if anyone wants to collaborate on > this. > > If it is easier I’m happy to start a separate thread for feature if people > feel like it would make the conversation easier. I can also create a > Google Doc for direct comments if that is preferred. > > Thanks, > > Micah > > > > P.S. In the interest of full disclosure, these ideas evolved in > collaboration with Brian Hulette and other colleagues at Google who are > interested in making use of Arrow in both internal and external projects. > > [1] https://issues.apache.org/jira/browse/ARROW-300 > > [2] https://issues.apache.org/jira/browse/ARROW-5224 > > [3] > > https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E > > [4] > > https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E > > [5] > > https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812 > > [6] https://github.com/apache/arrow/pull/4815 > > [7] > > https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/ > > [8] https://issues.apache.org/jira/browse/ARROW-5821 >