I should be more clear about why I didn't suggest a sketch as an option here. Sketches like HLL that can be merged are too large to go into data file metadata. For sketches, I think the partition-level stats and secondary index structure is the right place for those. The question I want to think through here is whether it is worth adding back the column for non-mergeable, simple distinct counts.
On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org> wrote: > I would agree with including distinct counts. > > As you point out there are a number of strategies that can be employed by > the engine based on additional information. You pointed out the > non-overlapping bounds, but similarly if the bounds overlap almost > entirely, you might be able to assume an even distribution and average > them. If the delta between lower and upper bounds overall are narrow, you > might even be able to choose the max value (at least for whole numbers). > > Another alternative would be to use an approx distinct with some form of > sketch/digest that would allow for better merging, but I feel the tradeoff > in space/complexity may not net out to better overall outcomes. > > -Dan > > > > On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote: > >> Hi everyone, >> >> I'm working on finalizing the spec for v2 right now and one thing that's >> outstanding is the map of file-level distinct counts. >> >> This field has some history. I added it in the original spec because I >> thought we'd want distinct value counts for cost-based optimization in SQL >> planners. But we later removed it because the counts aren't mergeable, >> making it hard to determine what to do with file-level distinct counts. In >> some cases, you'd want to add them together (when sorted by the column) and >> in others you'd want to use the max across files. I thought that the idea >> of having counts was misguided, so we removed the column. >> >> I've recently talked with people working on SQL planners and they >> suggested adding the column back and populating it because even distinct >> counts that are hard to work with are better than nothing. >> >> There may also be heuristics for working with the counts that make it >> possible to get decent estimates across files. For example, if the column >> bounds do not overlap between files (like 0-10, 11-20, 21-30), that is an >> indication that the column is sorted and the distinct counts should be >> added together. >> >> Thanks to Yan, we now have a metrics framework we could use to populate >> these, although it would take some work to find a good way to estimate the >> distinct counts. For v2, should we add the distinct counts map back to file >> metadata and populate it? >> >> Rayn >> >> -- >> Ryan Blue >> Tabular >> > -- Ryan Blue Tabular