I would agree with including distinct counts. As you point out there are a number of strategies that can be employed by the engine based on additional information. You pointed out the non-overlapping bounds, but similarly if the bounds overlap almost entirely, you might be able to assume an even distribution and average them. If the delta between lower and upper bounds overall are narrow, you might even be able to choose the max value (at least for whole numbers).
Another alternative would be to use an approx distinct with some form of sketch/digest that would allow for better merging, but I feel the tradeoff in space/complexity may not net out to better overall outcomes. -Dan On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote: > Hi everyone, > > I'm working on finalizing the spec for v2 right now and one thing that's > outstanding is the map of file-level distinct counts. > > This field has some history. I added it in the original spec because I > thought we'd want distinct value counts for cost-based optimization in SQL > planners. But we later removed it because the counts aren't mergeable, > making it hard to determine what to do with file-level distinct counts. In > some cases, you'd want to add them together (when sorted by the column) and > in others you'd want to use the max across files. I thought that the idea > of having counts was misguided, so we removed the column. > > I've recently talked with people working on SQL planners and they > suggested adding the column back and populating it because even distinct > counts that are hard to work with are better than nothing. > > There may also be heuristics for working with the counts that make it > possible to get decent estimates across files. For example, if the column > bounds do not overlap between files (like 0-10, 11-20, 21-30), that is an > indication that the column is sorted and the distinct counts should be > added together. > > Thanks to Yan, we now have a metrics framework we could use to populate > these, although it would take some work to find a good way to estimate the > distinct counts. For v2, should we add the distinct counts map back to file > metadata and populate it? > > Rayn > > -- > Ryan Blue > Tabular >