+ dev@spark because authors of variant may not subscribe to dev@parquet On Mon, Sep 16, 2024 at 7:33 PM Antoine Pitrou <anto...@python.org> wrote:
> > Hello, > > I've been reading the spec in more detail here: > > https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179089ebd71ad/common/variant/README.md#encoding-types > > and I think that it should have a Security section listing potential > security issues with this format (especially for readers). > > Given that Parquet is frequently used to make data publicly available > online, it is important for implementers to know of potential issues to > look for, and ideally protect against. > > > One specific concern is the following snippet about the Object encoding: > > "The field ids and field offsets must be in lexicographical order of the > corresponding field names in the metadata dictionary. However, the > actual value entries do not need to be in any particular order. This > implies that the field_offset values may not be monotonically > increasing." > > Having field offsets which are not monotonically increasing makes it > difficult to verify that the encoded values do not overlap. In general, > it's useful for data formats to enable easy validation and error report. > In this particular case, an attacker could perhaps craft a malicious > Variant with deeply nested overlapping values to achieve a denial of > service attack, similar to > https://en.wikipedia.org/wiki/Billion_laughs_attack > > (I'm not saying such a malicious Variant is practically doable given > specifics of the binary encoding, but it will be difficult to prove > that it isn't) > > Regards > > Antoine. > > >