Hi Jack,

For NDV sketch, how much space does it typically take per file.  One issue
is it might increase manifest file size.  I think one way of making it more
palatable to add additional statistics is looking at the possibility of
using a columnar format as an alternative to Avro, so that there is less
overhead for readers that don't need the additional stats (I think this has
been discussed previously as a possible area of interest).

For item #2, I think this would be valuable, but at the moment, at least
for Parquet memory usage for variable size columns isn't easily available
(only encoded size).  [1] is aimed at addressing this.

Thanks,
Micah

[1] https://github.com/apache/parquet-format/pull/197

On Tue, Jun 6, 2023 at 11:18 AM Jack Ye <yezhao...@gmail.com> wrote:

> Hi everyone,
>
> I would like to know what people think about adding additional indexes for
> data files in manifests. I remember we talked about this topic multiple
> times in the past but never got to any conclusion.
>
> There are 2 use cases we are thinking about at this moment:
>
> 1. *NDV sketch*: we have been evaluating the NDV theta sketch stored in
> Puffin, which is generated at table level. At data file level, we only
> store the distinct_count values (long). It seems like both are not as
> useful as directly storing file level NDV sketch. The individual sketches
> can be aggregated to produce the estimated NDV when predicates are applied
> on the table. This feels to be even more useful than partition level stats
> because we can improve estimate with non-partition column predicates.
>
> 2. *Column size*: at this moment, column size is the size on disk instead
> of the size in memory. This does not help much in estimating the expected
> memory usage of variable size column types. For example, currently some magic
> numbers
> <https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L194-L207>
> are used in Trino to estimate that.
>
> What is the community's thought around adding more stats / index storage
> in manifest for each data file, and update writers to write more stats and
> indexes? Are we going to handling this case by case to add to the manifest
> spec, or should we make this a pluggable interface in manifest? Or should
> we extend Puffin to suppose these additional data file level indexes?
>
> I definitely want to do a more concrete design around this topic, but
> would like to know some general ideas first around this subject.
>
> Best,
> Jack Ye
>
>
>

Reply via email to