RE: RE: Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Guy Khazma Wed, 04 Sep 2024 10:31:38 -0700

Hi Everyone,

Apologies if you get this email twice, for some reason I am unable to see
the email I sent earlier in the mailing list so sending it again.


I wanted to revisit the discussion about using partition stats for min/max
and null counts. It seems we might need to compute the null count at query
time in any case. This is because, during manifest scanning, some data
files may be filtered out based on query predicates. This could lead to a
situation where the number of rows is less than the number of nulls for a
partition or table if these counts are collected statically. In such cases,
Spark might incorrectly estimate zero rows if an isNotNull predicate is
used.

However, min/max values can still be pre-computed at the partition level,
as they remain valid as lower and upper bounds even with additional
filtering.

Any thoughts? If collecting null counts (and possibly min/max values) on
the fly seems reasonable, I can open a PR to implement it.

Thanks,

Guy

RE: RE: Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Reply via email to