Hi,

I’m looking to better understand the intent of some of the language around 
table statistics and related puffin file usage.  From my read of the spec, 
which may be overly pedantic, it seems like attaching anything other than NDV + 
an associated compact theta sketch is not compliant with the spec:

In the section on Table 
Statistics<http://iceberg.apache.org/spec/#table-statistics> it’s explicit that 
statistics are meant to be informational only, and that readers can ignore 
statistics at will: “Statistics are informational. A reader can choose to 
ignore statistics information. Statistics support is not required to read the 
table correctly.”  However, it says earlier that statistics are stored in 
‘valid puffin files’, which can contain exactly two blob 
types<https://iceberg.apache.org/puffin-spec/#blob-types>: 
‘apache-datasketches-theta-v1` and `deletion-vector-v1`.

I can appreciate that a reasonable engine author, upon encountering an 
unexpected blob type in a Puffin file, would ignore it as statistics are purely 
informational.  However, given that puffin files are now both informational and 
critical for correctness (albeit in different contexts), I could see another 
reasonable engine author choosing to fail a query as the table isn’t compliant 
to the spec.  “Breaking” a customer’s usage of their table is just about the 
worst thing we can do, so I’d really appreciate some community guidance here.

Asked explicitly: Is an Iceberg table with a Statistics file containing a blob 
type other than ‘apache-datasketches-theta-v1` and `deletion-vector-v1` a valid 
Iceberg table?  Should engines ignore unrecognized blob types in blob metadata 
structs and associated statistics file?

--Carl

Reply via email to