I should be more clear about why I didn't suggest a sketch as an option
here. Sketches like HLL that can be merged are too large to go into data
file metadata. For sketches, I think the partition-level stats and
secondary index structure is the right place for those. The question I want
to think through here is whether it is worth adding back the column for
non-mergeable, simple distinct counts.

On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org> wrote:

> I would agree with including distinct counts.
>
> As you point out there are a number of strategies that can be employed by
> the engine based on additional information.  You pointed out the
> non-overlapping bounds, but similarly if the bounds overlap almost
> entirely, you might be able to assume an even distribution and average
> them.  If the delta between lower and upper bounds overall are narrow, you
> might even be able to choose the max value (at least for whole numbers).
>
> Another alternative would be to use an approx distinct with some form of
> sketch/digest that would allow for better merging, but I feel the tradeoff
> in space/complexity may not net out to better overall outcomes.
>
> -Dan
>
>
>
> On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote:
>
>> Hi everyone,
>>
>> I'm working on finalizing the spec for v2 right now and one thing that's
>> outstanding is the map of file-level distinct counts.
>>
>> This field has some history. I added it in the original spec because I
>> thought we'd want distinct value counts for cost-based optimization in SQL
>> planners. But we later removed it because the counts aren't mergeable,
>> making it hard to determine what to do with file-level distinct counts. In
>> some cases, you'd want to add them together (when sorted by the column) and
>> in others you'd want to use the max across files. I thought that the idea
>> of having counts was misguided, so we removed the column.
>>
>> I've recently talked with people working on SQL planners and they
>> suggested adding the column back and populating it because even distinct
>> counts that are hard to work with are better than nothing.
>>
>> There may also be heuristics for working with the counts that make it
>> possible to get decent estimates across files. For example, if the column
>> bounds do not overlap between files (like 0-10, 11-20, 21-30), that is an
>> indication that the column is sorted and the distinct counts should be
>> added together.
>>
>> Thanks to Yan, we now have a metrics framework we could use to populate
>> these, although it would take some work to find a good way to estimate the
>> distinct counts. For v2, should we add the distinct counts map back to file
>> metadata and populate it?
>>
>> Rayn
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Reply via email to