Thanks all for the input,

> But it cannot be a `deletion-vector-v1`
Yes, sorry – I was too liberal with the copy and paste there.

> I recommend discussing with the community and contributing that stats to 
> Iceberg puffin spec and standardizing to support interoperability with other 
> engines.
Agree wholeheartedly, and I hope to do so in the coming months.  That said, I’d 
like to formalize that readers should not fail when they encounter statistic 
types they don’t recognize.  I feel like this is especially important for 
environments that may trail the latest spec/release version by considerable 
margin.  Assume we agree on and develop support for most common values as part 
of a 1.13 release; I’d like to be able to safely provide that statistic on 
tables in mixed environments running 1.10 and 1.13.

In terms of changes to the spec, I don’t think I have anything in mind right 
now for Puffin, but I would be curious to hear opinions on changing the Table 
spec to read something to the effect of:
“… Table statistics files are valid Puffin 
files<https://iceberg.apache.org/puffin-spec/>. Statistics are informational. A 
reader can choose to ignore statistics information and should ignore 
unrecognized blob types. Statistics support is not required to read the table 
correctly. …”

--Carl

From: Gábor Kaszab <gaborkas...@apache.org>
Reply to: "dev@iceberg.apache.org" <dev@iceberg.apache.org>
Date: Wednesday, 6 August 2025 at 13:19
To: "dev@iceberg.apache.org" <dev@iceberg.apache.org>
Cc: "summ...@amazon.co.uk.invalid" <summ...@amazon.co.uk.invalid>
Subject: RE: [EXTERNAL] Clarification on the flexibility of table statistics 
information


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Hi,

Puffin files are defined in a way that in practice, engines can put anything 
into them, not just what is standardized by the spec. I recall Hive serializes 
its own stats object and stores them in Puffin. However, while this information 
is not standardized, it can't be expected that any other engine apart from the 
writer understands them. The practice is that the engines simply ignore the 
blob types in Puffin that they don't understand.

Is there any specific addition to the Puffin spec you have in mind?

Regards,
Gabor

Ajantha Bhat <ajanthab...@gmail.com<mailto:ajanthab...@gmail.com>> ezt írta 
(időpont: 2025. aug. 6., Sze, 9:09):
Hi,
From my read of the spec, which may be overly pedantic, it seems like attaching 
anything other than NDV + an associated compact theta sketch is not compliant 
with the spec:

True.
In the section on Table 
Statistics<http://iceberg.apache.org/spec/#table-statistics> it’s explicit that 
statistics are meant to be informational only, and that readers can ignore 
statistics at will: “Statistics are informational. A reader can choose to 
ignore statistics information. Statistics support is not required to read the 
table correctly.”  However, it says earlier that statistics are stored in 
‘valid puffin files’, which can contain exactly two blob 
types<https://iceberg.apache.org/puffin-spec/#blob-types>: 
‘apache-datasketches-theta-v1` and `deletion-vector-v1`.

Yes. Statistics are optional.
But it cannot be a `deletion-vector-v1`. It is the other way around. Puffin 
files can be used to store statistics as well as indexes, delete files (like 
deletion vectors). When we enhanced the puffin spec to include 
`deletion-vector-v1`, we didn't update the statistics table spec to clarify 
that it can only be ‘apache-datasketches-theta-v1`. Feel free to open a PR to 
clarify it.
Asked explicitly: Is an Iceberg table with a Statistics file containing a blob 
type other than ‘apache-datasketches-theta-v1` and `deletion-vector-v1` a valid 
Iceberg table?  Should engines ignore unrecognized blob types in blob metadata 
structs and associated statistics file?

Yes. Engines should ignore it as stats are optional.
Currently the spark 
integration<https://github.com/apache/iceberg/blob/772c8275598e43d2c5ef029bfe83aeaa6c713e8a/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L226>
 ignores it.

Lastly, I sense that you have proprietary stats stored as puffin files as a new 
blob type and figuring out how the interoperability works if other engines 
cannot understand it. I recommend discussing with the community and 
contributing that stats to Iceberg puffin spec and standardizing to support 
interoperability with other engines.

- Ajantha

On Tue, Aug 5, 2025 at 8:44 PM Summers, Carl <summ...@amazon.co.uk.invalid> 
wrote:
Hi,

I’m looking to better understand the intent of some of the language around 
table statistics and related puffin file usage.  From my read of the spec, 
which may be overly pedantic, it seems like attaching anything other than NDV + 
an associated compact theta sketch is not compliant with the spec:

In the section on Table 
Statistics<http://iceberg.apache.org/spec/#table-statistics> it’s explicit that 
statistics are meant to be informational only, and that readers can ignore 
statistics at will: “Statistics are informational. A reader can choose to 
ignore statistics information. Statistics support is not required to read the 
table correctly.”  However, it says earlier that statistics are stored in 
‘valid puffin files’, which can contain exactly two blob 
types<https://iceberg.apache.org/puffin-spec/#blob-types>: 
‘apache-datasketches-theta-v1` and `deletion-vector-v1`.

I can appreciate that a reasonable engine author, upon encountering an 
unexpected blob type in a Puffin file, would ignore it as statistics are purely 
informational.  However, given that puffin files are now both informational and 
critical for correctness (albeit in different contexts), I could see another 
reasonable engine author choosing to fail a query as the table isn’t compliant 
to the spec.  “Breaking” a customer’s usage of their table is just about the 
worst thing we can do, so I’d really appreciate some community guidance here.

Asked explicitly: Is an Iceberg table with a Statistics file containing a blob 
type other than ‘apache-datasketches-theta-v1` and `deletion-vector-v1` a valid 
Iceberg table?  Should engines ignore unrecognized blob types in blob metadata 
structs and associated statistics file?

--Carl

Reply via email to