Hi, I’m looking to better understand the intent of some of the language around table statistics and related puffin file usage. From my read of the spec, which may be overly pedantic, it seems like attaching anything other than NDV + an associated compact theta sketch is not compliant with the spec:
In the section on Table Statistics<http://iceberg.apache.org/spec/#table-statistics> it’s explicit that statistics are meant to be informational only, and that readers can ignore statistics at will: “Statistics are informational. A reader can choose to ignore statistics information. Statistics support is not required to read the table correctly.” However, it says earlier that statistics are stored in ‘valid puffin files’, which can contain exactly two blob types<https://iceberg.apache.org/puffin-spec/#blob-types>: ‘apache-datasketches-theta-v1` and `deletion-vector-v1`. I can appreciate that a reasonable engine author, upon encountering an unexpected blob type in a Puffin file, would ignore it as statistics are purely informational. However, given that puffin files are now both informational and critical for correctness (albeit in different contexts), I could see another reasonable engine author choosing to fail a query as the table isn’t compliant to the spec. “Breaking” a customer’s usage of their table is just about the worst thing we can do, so I’d really appreciate some community guidance here. Asked explicitly: Is an Iceberg table with a Statistics file containing a blob type other than ‘apache-datasketches-theta-v1` and `deletion-vector-v1` a valid Iceberg table? Should engines ignore unrecognized blob types in blob metadata structs and associated statistics file? --Carl