Doing some light research it looks xxhash has better cross-platform support as is faster then a vanilla implementation of crc32 [1]. However, crc32c (a slightly different crc32 algorithm) is hardware accelerated on newer (circa 2016) Intel CPUs [2] and is potentially faster.
[1] https://cyan4973.github.io/xxHash/ [2] https://github.com/Cyan4973/xxHash/issues/62 On Tue, Mar 5, 2019 at 9:55 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Thanks Philipp, > > Yeah, I probably shouldn't have said SHA1 either :) I'm not too > concerned with a particular hash/checksum implementation. It would be good > to have at least 1 or 2 well supported ones, and a migration path to > support more if necessary without breaking file/streaming formats for > backwards compatibility. > > Best, > -Micah > > On Tue, Mar 5, 2019 at 9:47 PM Philipp Moritz <pcmor...@gmail.com> wrote: > >> (I meant to say SHA256 instead of SHA1) >> >> On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmor...@gmail.com> wrote: >> >>> Hey Micah, >>> >>> in plasma, we are using xxhash to compute a hash/checksum [1] (it is >>> computed in parallel using multiple threads) and have good experience with >>> it -- all data in Ray is checksummed this way. Initially there were >>> problems with uninitialized bits in the arrow representation, but that has >>> been resolved a while back, so there should be no blocker for this. It >>> would also be great to benchmark xxhash against CRC32 and see how they >>> compare performance wise. Initially we used SHA1 but there was non-trivial >>> overhead. Maybe there is a better implementation out there (we used >>> https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c >>> ). >>> >>> Best, >>> Philipp. >>> >>> [1] >>> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684 >>> >>> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> Hi Arrow Dev, >>>> As we expand the use-cases for Arrow to move it more across system >>>> boundaries (Flight) and make it live longer (e.g. in the file format), >>>> it >>>> seems to make sense to build in a mechanism for data integrity >>>> verification >>>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash like >>>> SHA1). >>>> >>>> This can be done a backwards compatible manner for the actual data >>>> buffers >>>> by adding metadata to the headers (this could be a use-case for custom >>>> metadata but I would prefer to make it explicit). However, to make >>>> sure we >>>> have full coverage, we would need to augment the stream [1] to be >>>> something >>>> like: >>>> >>>> <metadata_size: int32> >>>> <metadata_flatbuffer: bytes> >>>> <signature_size: int16> >>>> <metadata signature> >>>> <padding> >>>> <message body> >>>> >>>> I don't think we should require implementations to actual use this >>>> functionality but we should make it a possibility (signature size could >>>> be >>>> zero meaning no checksum/hash is provided) and have it be standardized >>>> if >>>> possible. >>>> >>>> Thoughts? >>>> >>>> Sorry if this has already been discussed but I could find anything from >>>> searching JIRA or the mailing list archive, and it doesn't look like it >>>> is >>>> in the format spec. >>>> >>>> Thanks, >>>> Micah >>>> >>>> [1] https://arrow.apache.org/docs/ipc.html >>>> >>>