+ dev@spark because authors of variant may not subscribe to dev@parquet

On Mon, Sep 16, 2024 at 7:33 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> I've been reading the spec in more detail here:
>
> https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179089ebd71ad/common/variant/README.md#encoding-types
>
> and I think that it should have a Security section listing potential
> security issues with this format (especially for readers).
>
> Given that Parquet is frequently used to make data publicly available
> online, it is important for implementers to know of potential issues to
> look for, and ideally protect against.
>
>
> One specific concern is the following snippet about the Object encoding:
>
> "The field ids and field offsets must be in lexicographical order of the
> corresponding field names in the metadata dictionary. However, the
> actual value entries do not need to be in any particular order. This
> implies that the field_offset values may not be monotonically
> increasing."
>
> Having field offsets which are not monotonically increasing makes it
> difficult to verify that the encoded values do not overlap. In general,
> it's useful for data formats to enable easy validation and error report.
> In this particular case, an attacker could perhaps craft a malicious
> Variant with deeply nested overlapping values to achieve a denial of
> service attack, similar to
> https://en.wikipedia.org/wiki/Billion_laughs_attack
>
> (I'm not saying such a malicious Variant is practically doable given
> specifics of the binary encoding, but it will be difficult to prove
> that it isn't)
>
> Regards
>
> Antoine.
>
>
>

Reply via email to