Hi everyone, I would like to know what people think about adding additional indexes for data files in manifests. I remember we talked about this topic multiple times in the past but never got to any conclusion.
There are 2 use cases we are thinking about at this moment: 1. *NDV sketch*: we have been evaluating the NDV theta sketch stored in Puffin, which is generated at table level. At data file level, we only store the distinct_count values (long). It seems like both are not as useful as directly storing file level NDV sketch. The individual sketches can be aggregated to produce the estimated NDV when predicates are applied on the table. This feels to be even more useful than partition level stats because we can improve estimate with non-partition column predicates. 2. *Column size*: at this moment, column size is the size on disk instead of the size in memory. This does not help much in estimating the expected memory usage of variable size column types. For example, currently some magic numbers <https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L194-L207> are used in Trino to estimate that. What is the community's thought around adding more stats / index storage in manifest for each data file, and update writers to write more stats and indexes? Are we going to handling this case by case to add to the manifest spec, or should we make this a pluggable interface in manifest? Or should we extend Puffin to suppose these additional data file level indexes? I definitely want to do a more concrete design around this topic, but would like to know some general ideas first around this subject. Best, Jack Ye