Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-18 Thread Antoine Pitrou
XXH3 (by the xxhash author) was recently presented, though it's still experimental for now: https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html It is claimed to be significantly faster than xxhash, on all message sizes. Regards Antoine. Le 06/03/2019 à 07:06, Micah Kornfield a

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-11 Thread Micah Kornfield
Hi Wes, Thanks for the response. I was thinking being able to checksum everything. I agree it should be off by default. I'll put this on the back burner for now. If I can find some spare time (which won't likely be any time soon), I'll submit a PR for further discussion. Cheers, Micah On Wed

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-06 Thread Wes McKinney
hi Micah, It seems like the checksum could be included in the Message flatbuffer table instead of having to add things to the protocol https://github.com/apache/arrow/blob/master/format/Message.fbs#L94 Am I correct that computing a checksum on the message body is what is mainly of interest? Beyo

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Micah Kornfield
Doing some light research it looks xxhash has better cross-platform support as is faster then a vanilla implementation of crc32 [1]. However, crc32c (a slightly different crc32 algorithm) is hardware accelerated on newer (circa 2016) Intel CPUs [2] and is potentially faster. [1] https://cyan4973.

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Micah Kornfield
Thanks Philipp, Yeah, I probably shouldn't have said SHA1 either :)I'm not too concerned with a particular hash/checksum implementation. It would be good to have at least 1 or 2 well supported ones, and a migration path to support more if necessary without breaking file/streaming formats for

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Philipp Moritz
Hey Micah, in plasma, we are using xxhash to compute a hash/checksum [1] (it is computed in parallel using multiple threads) and have good experience with it -- all data in Ray is checksummed this way. Initially there were problems with uninitialized bits in the arrow representation, but that has

Re: [Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Philipp Moritz
(I meant to say SHA256 instead of SHA1) On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz wrote: > Hey Micah, > > in plasma, we are using xxhash to compute a hash/checksum [1] (it is > computed in parallel using multiple threads) and have good experience with > it -- all data in Ray is checksummed t

[Discuss][Format] Checksum/Hash signature for data

2019-03-05 Thread Micah Kornfield
Hi Arrow Dev, As we expand the use-cases for Arrow to move it more across system boundaries (Flight) and make it live longer (e.g. in the file format), it seems to make sense to build in a mechanism for data integrity verification (e.g. a checksum like CRC32 or in some cases a cryptographic hash li