Thanks all for the input, > But it cannot be a `deletion-vector-v1` Yes, sorry – I was too liberal with the copy and paste there.
> I recommend discussing with the community and contributing that stats to > Iceberg puffin spec and standardizing to support interoperability with other > engines. Agree wholeheartedly, and I hope to do so in the coming months. That said, I’d like to formalize that readers should not fail when they encounter statistic types they don’t recognize. I feel like this is especially important for environments that may trail the latest spec/release version by considerable margin. Assume we agree on and develop support for most common values as part of a 1.13 release; I’d like to be able to safely provide that statistic on tables in mixed environments running 1.10 and 1.13. In terms of changes to the spec, I don’t think I have anything in mind right now for Puffin, but I would be curious to hear opinions on changing the Table spec to read something to the effect of: “… Table statistics files are valid Puffin files<https://iceberg.apache.org/puffin-spec/>. Statistics are informational. A reader can choose to ignore statistics information and should ignore unrecognized blob types. Statistics support is not required to read the table correctly. …” --Carl From: Gábor Kaszab <gaborkas...@apache.org> Reply to: "dev@iceberg.apache.org" <dev@iceberg.apache.org> Date: Wednesday, 6 August 2025 at 13:19 To: "dev@iceberg.apache.org" <dev@iceberg.apache.org> Cc: "summ...@amazon.co.uk.invalid" <summ...@amazon.co.uk.invalid> Subject: RE: [EXTERNAL] Clarification on the flexibility of table statistics information CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi, Puffin files are defined in a way that in practice, engines can put anything into them, not just what is standardized by the spec. I recall Hive serializes its own stats object and stores them in Puffin. However, while this information is not standardized, it can't be expected that any other engine apart from the writer understands them. The practice is that the engines simply ignore the blob types in Puffin that they don't understand. Is there any specific addition to the Puffin spec you have in mind? Regards, Gabor Ajantha Bhat <ajanthab...@gmail.com<mailto:ajanthab...@gmail.com>> ezt írta (időpont: 2025. aug. 6., Sze, 9:09): Hi, From my read of the spec, which may be overly pedantic, it seems like attaching anything other than NDV + an associated compact theta sketch is not compliant with the spec: True. In the section on Table Statistics<http://iceberg.apache.org/spec/#table-statistics> it’s explicit that statistics are meant to be informational only, and that readers can ignore statistics at will: “Statistics are informational. A reader can choose to ignore statistics information. Statistics support is not required to read the table correctly.” However, it says earlier that statistics are stored in ‘valid puffin files’, which can contain exactly two blob types<https://iceberg.apache.org/puffin-spec/#blob-types>: ‘apache-datasketches-theta-v1` and `deletion-vector-v1`. Yes. Statistics are optional. But it cannot be a `deletion-vector-v1`. It is the other way around. Puffin files can be used to store statistics as well as indexes, delete files (like deletion vectors). When we enhanced the puffin spec to include `deletion-vector-v1`, we didn't update the statistics table spec to clarify that it can only be ‘apache-datasketches-theta-v1`. Feel free to open a PR to clarify it. Asked explicitly: Is an Iceberg table with a Statistics file containing a blob type other than ‘apache-datasketches-theta-v1` and `deletion-vector-v1` a valid Iceberg table? Should engines ignore unrecognized blob types in blob metadata structs and associated statistics file? Yes. Engines should ignore it as stats are optional. Currently the spark integration<https://github.com/apache/iceberg/blob/772c8275598e43d2c5ef029bfe83aeaa6c713e8a/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L226> ignores it. Lastly, I sense that you have proprietary stats stored as puffin files as a new blob type and figuring out how the interoperability works if other engines cannot understand it. I recommend discussing with the community and contributing that stats to Iceberg puffin spec and standardizing to support interoperability with other engines. - Ajantha On Tue, Aug 5, 2025 at 8:44 PM Summers, Carl <summ...@amazon.co.uk.invalid> wrote: Hi, I’m looking to better understand the intent of some of the language around table statistics and related puffin file usage. From my read of the spec, which may be overly pedantic, it seems like attaching anything other than NDV + an associated compact theta sketch is not compliant with the spec: In the section on Table Statistics<http://iceberg.apache.org/spec/#table-statistics> it’s explicit that statistics are meant to be informational only, and that readers can ignore statistics at will: “Statistics are informational. A reader can choose to ignore statistics information. Statistics support is not required to read the table correctly.” However, it says earlier that statistics are stored in ‘valid puffin files’, which can contain exactly two blob types<https://iceberg.apache.org/puffin-spec/#blob-types>: ‘apache-datasketches-theta-v1` and `deletion-vector-v1`. I can appreciate that a reasonable engine author, upon encountering an unexpected blob type in a Puffin file, would ignore it as statistics are purely informational. However, given that puffin files are now both informational and critical for correctness (albeit in different contexts), I could see another reasonable engine author choosing to fail a query as the table isn’t compliant to the spec. “Breaking” a customer’s usage of their table is just about the worst thing we can do, so I’d really appreciate some community guidance here. Asked explicitly: Is an Iceberg table with a Statistics file containing a blob type other than ‘apache-datasketches-theta-v1` and `deletion-vector-v1` a valid Iceberg table? Should engines ignore unrecognized blob types in blob metadata structs and associated statistics file? --Carl