[
https://issues.apache.org/jira/browse/IMPALA-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082041#comment-18082041
]
Zoltán Borók-Nagy commented on IMPALA-15004:
--------------------------------------------
I think we should at least give a chance to Option 1 (Theta sketches) because
of interoperability. Even though we will still need to write an Impala-specific
blob to store the non-NDV stats.
If there are available stats written by another engine, we could just build
upon it, and vice versa.
With today's LLMs I think it should be fairly easy to generate enough test
coverage.
> Puffin stats writer for Iceberg tables
> --------------------------------------
>
> Key: IMPALA-15004
> URL: https://issues.apache.org/jira/browse/IMPALA-15004
> Project: IMPALA
> Issue Type: New Feature
> Reporter: Zoltán Borók-Nagy
> Assignee: Mihaly Szjatinya
> Priority: Major
> Labels: impala-iceberg, impala-iceberg-active-backlog
>
> Currently COMPUTE STATS store column statistics only in HMS.
> Iceberg has Puffin files for this purpose, but currently there's only a
> single blob type (Apache Theta sketches) we can store that only supports NDV.
> Impala should comply to Iceberg's standards and write Puffin files. The stats
> that cannot be stored in well-known Iceberg Puffin blob types could be stored
> in custom Impala blobs. That way all statistics information could be
> retrieved from a single place.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]