[ 
https://issues.apache.org/jira/browse/IMPALA-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900652#comment-17900652
 ] 

ASF subversion and git services commented on IMPALA-13370:
----------------------------------------------------------

Commit e5919f13f93ae6e5cfa9fb01219bddb84c9cc474 in impala's branch 
refs/heads/master from Daniel Becker
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e5919f13f ]

IMPALA-13370: Read Puffin stats from metadata.json property if available

When Trino writes Puffin stats for a column, it includes the NDV as a
property (with key "ndv") in the "statistics" section of the
metadata.json file, in addition to the Theta sketch in the Puffin file.
When we are only reading the stats and not writing/updating them, it is
enough to read this property if it is present.

After this change, Impala only opens and reads a Puffin stats file if it
contains stats for at least one column for which the "ndv" property is
not set in the metadata.json file.

Testing:
 - added a test in test_iceberg_with_puffin.py that verifies that the
   Puffin stats file is not read if the the metadata.json file contains
   the NDV property. It uses the newly added stats file with corrupt
   datasketches: 'metadata_ndv_ok_sketches_corrupt.stats'.

Change-Id: I5e92056ce97c4849742db6309562af3b575f647b
Reviewed-on: http://gerrit.cloudera.org:8080/21959
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Read Puffin stats from metadata.json property if available
> ----------------------------------------------------------
>
>                 Key: IMPALA-13370
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13370
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Daniel Becker
>            Assignee: Daniel Becker
>            Priority: Major
>              Labels: impala-iceberg
>
> When Trino writes Puffin stats for a column, it includes the NDV as a 
> property in the "statistics" section of the metadata.json file, in addition 
> to the Theta sketch in the Puffin file. When we are only reading the stats 
> and not writing/updating them, it would be enough to read this property if it 
> is present.
> An example of the "statistics" section:
> {code:java}
> "statistics" : [ {
>     "snapshot-id" : 1226095104912303892,
>     "statistics-path" : 
> "hdfs://localhost:20500/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_query_metadata/metadata/20240829_112839_00004_p6sck-7f433a45-607b-4561-89a3-fc4c58ef60d9.stats",
>     "file-size-in-bytes" : 306,
>     "file-footer-size-in-bytes" : 257,
>     "blob-metadata" : [ {
>       "type" : "apache-datasketches-theta-v1",
>       "snapshot-id" : 1226095104912303892,
>       "sequence-number" : 4,
>       "fields" : [ 1 ],
>       "properties" : {
>         "ndv" : "2"
>       }
>     } ]
>   } ]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to