Re: [DISCUSS][FORMAT] Data Integrity

Antoine Pitrou Mon, 15 Jul 2019 08:06:58 -0700


Le 15/07/2019 à 16:15, Wes McKinney a écrit :
> If we adopt the position (as we already are in practice, I think) that
> the encapsulated IPC message format is the main way that we expose
> data from one process to another, then having digests at the message
> level seems like the simplest and most useful thing.
> 
> FWIW, the Parquet format technically provides for CRC checksums but
> has never been widely implemented, so there is a certain YAGNI feeling
> to doing anything complex on this.


You may be right.  Also, if the transport uses TLS, there's some data
integrity built-in already.

I suspect checksumming may be desirable mostly for archival purposes,
which Arrow is not aimed at.

Regards

Antoine.



> 
> On Fri, Jul 12, 2019 at 4:30 AM Antoine Pitrou <anto...@python.org> wrote:
>>
>>
>>
>> Le 12/07/2019 à 09:56, Micah Kornfield a écrit :
>>> Per Antoine's recommendation.  I'm splitting off the discussion about data
>>> integrity from the previous e-mail thread about the format additions [1].
>>> To re-cap I made a proposal including data integrity [2] by adding a new
>>> message type to the
>>>
>>> From the previous thread the main question was at what level to apply
>>> digests to Arrow data (Message level, array, buffer or potentially some
>>> hybrid).
>>>
>>> Some trade-offs I've thought of for each approach:
>>> * Message level
>>> + Simplest implementation and can be applied across all messages with the
>>> pretty much the same code.
>>> + Smallest amount of additional data (each digest will likely be 8-64 bytes)
>>> - It lacks granularity to recover partial data from a record batch if there
>>> is corruption.
>>
>> Also:
>> - Will only apply to transmission errors using the IPC mechanism, not
>> other kinds of errors that may occur
>>
>>> Array level:
>>> + Allows for reading non-corrupted columns
>>> + Allows for potentially more complicated use-cases like have different
>>> compute engines "collaborate" and sign each array they computed to
>>> establish a "chain-of-trust"
>>> - Adds some implementation complexity. Will need different schemes for
>>> message types other than RecordBatch and for message metadata.  We also
>>> need to determine digest boundaries (would a complex column be consumed
>>> entirely or would child arrays be separate).
>>
>> Also:
>> - Need to compute a new checksum when slicing an array?
>>
>>> Buffer level:
>>> More or less same issues as array but with the following other factors:
>>> - The most amount of additional data
>>
>> It's not clear that's much of a problem (currently?), especially if
>> checksumming is optional.  Arrow isn't well-suited for use cases with
>> many tiny buffers...
>>
>>> - Its not clear if there is a benefit of detecting if a single buffer is
>>> corrupted if it means we can't accurately decode the array.
>>
>> Also:
>> + decorrelated from logical interpretation of buffer, e.g. slicing
>>
>> I think the possibility of a hybrid scheme should be discussed as well.
>>  For example, compute physical checksums at the buffer level, then
>> devise a lightweight formula for the checkum of an array based on those
>> physical checksums.  And a formula for an IPC message's checksum based
>> on its type (schema, record batch, dictionary...).
>>
>> Regards
>>
>> Antoine.

Re: [DISCUSS][FORMAT] Data Integrity

Reply via email to