Additional indexes for data files

Jack Ye Tue, 06 Jun 2023 11:17:06 -0700

Hi everyone,

I would like to know what people think about adding additional indexes for
data files in manifests. I remember we talked about this topic multiple
times in the past but never got to any conclusion.

There are 2 use cases we are thinking about at this moment:

1. *NDV sketch*: we have been evaluating the NDV theta sketch stored in
Puffin, which is generated at table level. At data file level, we only
store the distinct_count values (long). It seems like both are not as
useful as directly storing file level NDV sketch. The individual sketches
can be aggregated to produce the estimated NDV when predicates are applied
on the table. This feels to be even more useful than partition level stats
because we can improve estimate with non-partition column predicates.

2. *Column size*: at this moment, column size is the size on disk instead
of the size in memory. This does not help much in estimating the expected
memory usage of variable size column types. For example, currently some magic
numbers
<https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L194-L207>
are used in Trino to estimate that.

What is the community's thought around adding more stats / index storage in
manifest for each data file, and update writers to write more stats and
indexes? Are we going to handling this case by case to add to the manifest
spec, or should we make this a pluggable interface in manifest? Or should
we extend Puffin to suppose these additional data file level indexes?

I definitely want to do a more concrete design around this topic, but would
like to know some general ideas first around this subject.

Best,
Jack Ye

Additional indexes for data files

Reply via email to