Hi Jack, For NDV sketch, how much space does it typically take per file. One issue is it might increase manifest file size. I think one way of making it more palatable to add additional statistics is looking at the possibility of using a columnar format as an alternative to Avro, so that there is less overhead for readers that don't need the additional stats (I think this has been discussed previously as a possible area of interest).
For item #2, I think this would be valuable, but at the moment, at least for Parquet memory usage for variable size columns isn't easily available (only encoded size). [1] is aimed at addressing this. Thanks, Micah [1] https://github.com/apache/parquet-format/pull/197 On Tue, Jun 6, 2023 at 11:18 AM Jack Ye <yezhao...@gmail.com> wrote: > Hi everyone, > > I would like to know what people think about adding additional indexes for > data files in manifests. I remember we talked about this topic multiple > times in the past but never got to any conclusion. > > There are 2 use cases we are thinking about at this moment: > > 1. *NDV sketch*: we have been evaluating the NDV theta sketch stored in > Puffin, which is generated at table level. At data file level, we only > store the distinct_count values (long). It seems like both are not as > useful as directly storing file level NDV sketch. The individual sketches > can be aggregated to produce the estimated NDV when predicates are applied > on the table. This feels to be even more useful than partition level stats > because we can improve estimate with non-partition column predicates. > > 2. *Column size*: at this moment, column size is the size on disk instead > of the size in memory. This does not help much in estimating the expected > memory usage of variable size column types. For example, currently some magic > numbers > <https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L194-L207> > are used in Trino to estimate that. > > What is the community's thought around adding more stats / index storage > in manifest for each data file, and update writers to write more stats and > indexes? Are we going to handling this case by case to add to the manifest > spec, or should we make this a pluggable interface in manifest? Or should > we extend Puffin to suppose these additional data file level indexes? > > I definitely want to do a more concrete design around this topic, but > would like to know some general ideas first around this subject. > > Best, > Jack Ye > > >