Re: [DISCUSS] Unencoded Variable Length Column Size Statistics

Ajantha Bhat Mon, 15 Jul 2024 06:16:17 -0700

Hi Samrose,

Thanks for the proposal.
+1 from my side as Iceberg should definitely leverage all info provided by
Parquet.
This can help in query planning (specially as the Join and exchange happens
with raw data).

I have also tagged Micah on the proposal as he worked on the same at
Parquet side.

Note:
Iceberg currently uses parquet 1.13.1 which depends on
<https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/pom.xml#L74>
parquet-format-2.9.0
*.*So, we need to bump the parquet version to 1.14.1 which uses the
parquet-format-2.10.0 to leverage these stats.
Fokko has an open PR for this. But it has some blockers (
https://github.com/apache/iceberg/pull/10209)

- Ajantha

On Mon, Jul 15, 2024 at 1:54 PM Samrose Ahmed <samroseah...@gmail.com>
wrote:

> Hello,
>
> I have added a proposal to be able to optionally track uncompressed
> unencoded column size statistics for variable length columns. Currently, it
> isn't possible to estimate memory size of variable length columns as
> `columnSizes` only contains compressed sizes.
>
> I've created an issue (https://github.com/apache/iceberg/issues/10703)
> and a document (
> https://docs.google.com/document/d/189kIZxx_dUloBCDPUz2Fh0BBOZSm2fXHHXWpdpq3DrU),
> would appreciate any feedback.
>
> Thanks,
> Samrose
>

Re: [DISCUSS] Unencoded Variable Length Column Size Statistics

Reply via email to