Hi,

Puffin files are defined in a way that in practice, engines can put
anything into them, not just what is standardized by the spec. I recall
Hive serializes its own stats object and stores them in Puffin. However,
while this information is not standardized, it can't be expected that any
other engine apart from the writer understands them. The practice is that
the engines simply ignore the blob types in Puffin that they don't
understand.

Is there any specific addition to the Puffin spec you have in mind?

Regards,
Gabor

Ajantha Bhat <ajanthab...@gmail.com> ezt írta (időpont: 2025. aug. 6., Sze,
9:09):

> Hi,
>
> From my read of the spec, which may be overly pedantic, it seems like
>> attaching anything other than NDV + an associated compact theta sketch is
>> *not* compliant with the spec:
>
>
> True.
>
> In the section on Table Statistics
>> <http://iceberg.apache.org/spec/#table-statistics> it’s explicit that
>> statistics are meant to be informational only, and that readers can ignore
>> statistics at will: “Statistics are informational. A reader can choose
>> to ignore statistics information. Statistics support is not required to
>> read the table correctly.”  However, it says earlier that statistics are
>> stored in ‘valid puffin files’, which can contain *exactly two* blob
>> types <https://iceberg.apache.org/puffin-spec/#blob-types>:
>> ‘apache-datasketches-theta-v1` and `deletion-vector-v1`.
>
>
> Yes. Statistics are optional.
> But it cannot be a `deletion-vector-v1`. It is the other way around.
> Puffin files can be used to store statistics as well as indexes, delete
> files (like deletion vectors). When we enhanced the puffin spec to include
> `deletion-vector-v1`, we didn't update the statistics table spec to clarify
> that it can only be ‘apache-datasketches-theta-v1`. Feel free to open a PR
> to clarify it.
>
> Asked explicitly: Is an Iceberg table with a Statistics file containing a
>> blob type other than ‘apache-datasketches-theta-v1` and
>> `deletion-vector-v1` a valid Iceberg table?  Should engines ignore
>> unrecognized blob types in blob metadata structs and associated statistics
>> file?
>
>
> Yes. Engines should ignore it as stats are optional.
> Currently the spark integration
> <https://github.com/apache/iceberg/blob/772c8275598e43d2c5ef029bfe83aeaa6c713e8a/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L226>
> ignores it.
>
> Lastly, I sense that you have proprietary stats stored as puffin files as
> a new blob type and figuring out how the interoperability works if other
> engines cannot understand it. I recommend discussing with the community and
> contributing that stats to Iceberg puffin spec and standardizing to support
> interoperability with other engines.
>
> - Ajantha
>
> On Tue, Aug 5, 2025 at 8:44 PM Summers, Carl <summ...@amazon.co.uk.invalid>
> wrote:
>
>> Hi,
>>
>>
>>
>> I’m looking to better understand the intent of some of the language
>> around table statistics and related puffin file usage.  From my read of the
>> spec, which may be overly pedantic, it seems like attaching anything other
>> than NDV + an associated compact theta sketch is *not* compliant with
>> the spec:
>>
>>
>>
>> In the section on Table Statistics
>> <http://iceberg.apache.org/spec/#table-statistics> it’s explicit that
>> statistics are meant to be informational only, and that readers can ignore
>> statistics at will: “Statistics are informational. A reader can choose
>> to ignore statistics information. Statistics support is not required to
>> read the table correctly.”  However, it says earlier that statistics are
>> stored in ‘valid puffin files’, which can contain *exactly two* blob
>> types <https://iceberg.apache.org/puffin-spec/#blob-types>:
>> ‘apache-datasketches-theta-v1` and `deletion-vector-v1`.
>>
>>
>>
>> I can appreciate that a reasonable engine author, upon encountering an
>> unexpected blob type in a Puffin file, would ignore it as statistics are
>> purely informational.  However, given that puffin files are now both
>> informational and critical for correctness (albeit in different contexts),
>> I could see another reasonable engine author choosing to fail a query as
>> the table isn’t compliant to the spec.  “Breaking” a customer’s usage of
>> their table is just about the worst thing we can do, so I’d really
>> appreciate some community guidance here.
>>
>>
>>
>> Asked explicitly: Is an Iceberg table with a Statistics file containing a
>> blob type other than ‘apache-datasketches-theta-v1` and
>> `deletion-vector-v1` a valid Iceberg table?  Should engines ignore
>> unrecognized blob types in blob metadata structs and associated statistics
>> file?
>>
>>
>>
>> --Carl
>>
>

Reply via email to