Le 15/07/2019 à 16:15, Wes McKinney a écrit : > If we adopt the position (as we already are in practice, I think) that > the encapsulated IPC message format is the main way that we expose > data from one process to another, then having digests at the message > level seems like the simplest and most useful thing. > > FWIW, the Parquet format technically provides for CRC checksums but > has never been widely implemented, so there is a certain YAGNI feeling > to doing anything complex on this.
You may be right. Also, if the transport uses TLS, there's some data integrity built-in already. I suspect checksumming may be desirable mostly for archival purposes, which Arrow is not aimed at. Regards Antoine. > > On Fri, Jul 12, 2019 at 4:30 AM Antoine Pitrou <anto...@python.org> wrote: >> >> >> >> Le 12/07/2019 à 09:56, Micah Kornfield a écrit : >>> Per Antoine's recommendation. I'm splitting off the discussion about data >>> integrity from the previous e-mail thread about the format additions [1]. >>> To re-cap I made a proposal including data integrity [2] by adding a new >>> message type to the >>> >>> From the previous thread the main question was at what level to apply >>> digests to Arrow data (Message level, array, buffer or potentially some >>> hybrid). >>> >>> Some trade-offs I've thought of for each approach: >>> * Message level >>> + Simplest implementation and can be applied across all messages with the >>> pretty much the same code. >>> + Smallest amount of additional data (each digest will likely be 8-64 bytes) >>> - It lacks granularity to recover partial data from a record batch if there >>> is corruption. >> >> Also: >> - Will only apply to transmission errors using the IPC mechanism, not >> other kinds of errors that may occur >> >>> Array level: >>> + Allows for reading non-corrupted columns >>> + Allows for potentially more complicated use-cases like have different >>> compute engines "collaborate" and sign each array they computed to >>> establish a "chain-of-trust" >>> - Adds some implementation complexity. Will need different schemes for >>> message types other than RecordBatch and for message metadata. We also >>> need to determine digest boundaries (would a complex column be consumed >>> entirely or would child arrays be separate). >> >> Also: >> - Need to compute a new checksum when slicing an array? >> >>> Buffer level: >>> More or less same issues as array but with the following other factors: >>> - The most amount of additional data >> >> It's not clear that's much of a problem (currently?), especially if >> checksumming is optional. Arrow isn't well-suited for use cases with >> many tiny buffers... >> >>> - Its not clear if there is a benefit of detecting if a single buffer is >>> corrupted if it means we can't accurately decode the array. >> >> Also: >> + decorrelated from logical interpretation of buffer, e.g. slicing >> >> I think the possibility of a hybrid scheme should be discussed as well. >> For example, compute physical checksums at the buffer level, then >> devise a lightweight formula for the checkum of an array based on those >> physical checksums. And a formula for an IPC message's checksum based >> on its type (schema, record batch, dictionary...). >> >> Regards >> >> Antoine.