Re: [DISCUSS][FORMAT] Data Integrity

Antoine Pitrou Fri, 12 Jul 2019 02:31:00 -0700

Le 12/07/2019 à 09:56, Micah Kornfield a écrit :
> Per Antoine's recommendation.  I'm splitting off the discussion about data
> integrity from the previous e-mail thread about the format additions [1].
> To re-cap I made a proposal including data integrity [2] by adding a new
> message type to the
> 
> From the previous thread the main question was at what level to apply
> digests to Arrow data (Message level, array, buffer or potentially some
> hybrid).
> 
> Some trade-offs I've thought of for each approach:
> * Message level
> + Simplest implementation and can be applied across all messages with the
> pretty much the same code.
> + Smallest amount of additional data (each digest will likely be 8-64 bytes)
> - It lacks granularity to recover partial data from a record batch if there
> is corruption.

Also:
- Will only apply to transmission errors using the IPC mechanism, not
other kinds of errors that may occur

> Array level:
> + Allows for reading non-corrupted columns
> + Allows for potentially more complicated use-cases like have different
> compute engines "collaborate" and sign each array they computed to
> establish a "chain-of-trust"
> - Adds some implementation complexity. Will need different schemes for
> message types other than RecordBatch and for message metadata.  We also
> need to determine digest boundaries (would a complex column be consumed
> entirely or would child arrays be separate).

Also:
- Need to compute a new checksum when slicing an array?

> Buffer level:
> More or less same issues as array but with the following other factors:
> - The most amount of additional data

It's not clear that's much of a problem (currently?), especially if
checksumming is optional.  Arrow isn't well-suited for use cases with
many tiny buffers...

> - Its not clear if there is a benefit of detecting if a single buffer is
> corrupted if it means we can't accurately decode the array.

Also:
+ decorrelated from logical interpretation of buffer, e.g. slicing

I think the possibility of a hybrid scheme should be discussed as well.
 For example, compute physical checksums at the buffer level, then
devise a lightweight formula for the checkum of an array based on those
physical checksums.  And a formula for an IPC message's checksum based
on its type (schema, record batch, dictionary...).

Regards

Antoine.
Re: [DISCUSS][FORMAT] Data Integrity

Reply via email to