hi Micah,

It seems like the checksum could be included in the Message flatbuffer
table instead of having to add things to the protocol

https://github.com/apache/arrow/blob/master/format/Message.fbs#L94

Am I correct that computing a checksum on the message body is what is
mainly of interest? Beyond that I feel like we're getting into the
realm of truly esoteric issues

Note that checksums are part of the Parquet file specification but
IIUC seldom used, so our mileage may vary on how useful this is. I
would probably want this turned off by default

- Wes

On Wed, Mar 6, 2019 at 12:07 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Doing some light research it looks xxhash has better cross-platform support
> as is faster then a vanilla implementation of crc32 [1].  However, crc32c
> (a slightly different crc32 algorithm) is hardware accelerated on newer
> (circa 2016) Intel CPUs [2] and is potentially faster.
>
> [1] https://cyan4973.github.io/xxHash/
> [2] https://github.com/Cyan4973/xxHash/issues/62
>
> On Tue, Mar 5, 2019 at 9:55 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Thanks Philipp,
> >
> > Yeah, I probably shouldn't have said SHA1 either :)    I'm not too
> > concerned with a particular hash/checksum implementation.  It would be good
> > to have at least 1 or 2 well supported ones, and a migration path to
> > support more if necessary without breaking file/streaming formats for
> > backwards compatibility.
> >
> > Best,
> > -Micah
> >
> > On Tue, Mar 5, 2019 at 9:47 PM Philipp Moritz <pcmor...@gmail.com> wrote:
> >
> >> (I meant to say SHA256 instead of SHA1)
> >>
> >> On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmor...@gmail.com> wrote:
> >>
> >>> Hey Micah,
> >>>
> >>> in plasma, we are using xxhash to compute a hash/checksum [1] (it is
> >>> computed in parallel using multiple threads) and have good experience with
> >>> it -- all data in Ray is checksummed this way. Initially there were
> >>> problems with uninitialized bits in the arrow representation, but that has
> >>> been resolved a while back, so there should be no blocker for this. It
> >>> would also be great to benchmark xxhash against CRC32 and see how they
> >>> compare performance wise. Initially we used SHA1 but there was non-trivial
> >>> overhead. Maybe there is a better implementation out there (we used
> >>> https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c
> >>> ).
> >>>
> >>> Best,
> >>> Philipp.
> >>>
> >>> [1]
> >>> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684
> >>>
> >>> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfi...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Arrow Dev,
> >>>> As we expand the use-cases for Arrow to move it more across system
> >>>> boundaries (Flight) and make it live longer (e.g. in the file format),
> >>>> it
> >>>> seems to make sense to build in a mechanism for data integrity
> >>>> verification
> >>>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash like
> >>>> SHA1).
> >>>>
> >>>> This can be done a backwards compatible manner for the actual data
> >>>> buffers
> >>>> by adding metadata to the headers (this could be a use-case for custom
> >>>> metadata but I would prefer to make it explicit).  However, to make
> >>>> sure we
> >>>> have full coverage, we would need to augment the stream [1] to be
> >>>> something
> >>>> like:
> >>>>
> >>>> <metadata_size: int32>
> >>>> <metadata_flatbuffer: bytes>
> >>>> <signature_size: int16>
> >>>> <metadata signature>
> >>>> <padding>
> >>>> <message body>
> >>>>
> >>>> I don't think we should require implementations to actual use this
> >>>> functionality but we should make it a possibility (signature size could
> >>>> be
> >>>> zero meaning no checksum/hash is provided) and have it be standardized
> >>>> if
> >>>> possible.
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> Sorry if this has already been discussed but I could find anything from
> >>>> searching JIRA or the mailing list archive, and it doesn't look like it
> >>>> is
> >>>> in the format spec.
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>> [1] https://arrow.apache.org/docs/ipc.html
> >>>>
> >>>

Reply via email to