I feel it’s better to ensure as much correctness in the statistics as
possible and then to let the engines make educated decisions about how they
want to work on that information.

I agree with this, but I’m wondering where the line is for “as much
correctness … as possible”.

It hadn’t occurred to me that someone might compact two files and also
merge the distinct counts. In that case, I completely agree that we should
require that the distinct count should be based on the data and not on a
calculation from other distinct counts.

But, I think it is probably important that these can be set from sketches
that don’t require keeping a set of full values. Otherwise, it could be
prohibitively expensive to produce them.

It probably makes sense to be clear in the docs, like Jack suggests: this
is an estimate of the number of distinct values produced from the actual
data, not from merging estimates (but possibly from merging the underlying
sketches).

On Fri, Jul 2, 2021 at 2:49 PM Jack Ye <yezhao...@gmail.com> wrote:

> Yes I think Dan has a good point here that I was trying to get to, the
> correctness aspect of it is the major reason that led me to consider the
> upper and lower bound approach, otherwise as Ryan described, the current
> count metrics could already be sufficient for planning purposes. With a
> bound, at least that bound can always be calculated correctly during any
> operation, whereas the distinct value count might drift if we use some
> heuristics. So if we decide to go with adding the count, it should be a
> nullable count such that if we cannot decide the true value it can be
> omitted. Or it could directly be defined as a distinct value estimate, but
> in that case I would prefer to have it as some sort of sketch implemented
> as a secondary index.
>
> -Jack Ye
>
> On Fri, Jul 2, 2021 at 9:01 AM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Jack, that's the same thought I had initially but I think we can actually
>> break this down into two separate issues.
>>
>> One is on the scan side which is how do we merge the information that we
>> have and I think that would you're describing is something that we can do
>> even without storing the lower and upper bounds.  The advantage is that on
>> scan side you can actually use other information that you have in order to
>> make better decisions about how to do that merging. If all you have is
>> lower and upper bounds you may actually lose a little bit of that fidelity
>> based on previous merges, compactions, etc.
>>
>> On the file side I'm a little concerned about using statistics that can
>> drift over time. If we're merging files stats can quickly become
>> non-representative for the actual data in the files themselves.  Beyond
>> merges even compactions can impact the actual stats within the file so in
>> many cases you would need to recalculate them anyway.
>>
>> I feel it's better to ensure as much correctness in the statistics as
>> possible and then to let the engines make educated decisions about how they
>> want to work on that information.
>>
>> I'd vote for having distinct counts in the stats but requiring that they
>> be accurate. I feel like it's better to require that they're dropped in the
>> event that the cannot be accurate.  This may cause some problems with
>> row-level deletes though so we have to be a little careful with the
>> implementation.
>>
>> -Dan
>>
>>
>>
>>
>> On Fri, Jul 2, 2021, 7:55 AM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> What about instead of distinct count, we introduce min and max possible
>>> distinct count? In the best case scenario, min and max equals, and we know
>>> exactly how many distinct values there are, and we can directly update the
>>> new distinct count. In the worst case, when merging 2 unsorted files,
>>> without the need for complicated estimation, we can know the max for file1
>>> and file2 is (max_file1 + max_file2), and the min is max(min_file1,
>>> min_file2). If files are merged without sort order, then this gap will
>>> continue to grow and become unable to provide as much useful information to
>>> planning. But when we perform a sort for rows in files, we can update the
>>> min and max to the same and reduce this gap and improve planning.
>>>
>>> -Jack Ye
>>>
>>> On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> I would agree with including distinct counts.
>>>>
>>>> As you point out there are a number of strategies that can be employed
>>>> by the engine based on additional information.  You pointed out the
>>>> non-overlapping bounds, but similarly if the bounds overlap almost
>>>> entirely, you might be able to assume an even distribution and average
>>>> them.  If the delta between lower and upper bounds overall are narrow, you
>>>> might even be able to choose the max value (at least for whole numbers).
>>>>
>>>> Another alternative would be to use an approx distinct with some form
>>>> of sketch/digest that would allow for better merging, but I feel the
>>>> tradeoff in space/complexity may not net out to better overall outcomes.
>>>>
>>>> -Dan
>>>>
>>>>
>>>>
>>>> On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I'm working on finalizing the spec for v2 right now and one thing
>>>>> that's outstanding is the map of file-level distinct counts.
>>>>>
>>>>> This field has some history. I added it in the original spec because I
>>>>> thought we'd want distinct value counts for cost-based optimization in SQL
>>>>> planners. But we later removed it because the counts aren't mergeable,
>>>>> making it hard to determine what to do with file-level distinct counts. In
>>>>> some cases, you'd want to add them together (when sorted by the column) 
>>>>> and
>>>>> in others you'd want to use the max across files. I thought that the idea
>>>>> of having counts was misguided, so we removed the column.
>>>>>
>>>>> I've recently talked with people working on SQL planners and they
>>>>> suggested adding the column back and populating it because even distinct
>>>>> counts that are hard to work with are better than nothing.
>>>>>
>>>>> There may also be heuristics for working with the counts that make it
>>>>> possible to get decent estimates across files. For example, if the column
>>>>> bounds do not overlap between files (like 0-10, 11-20, 21-30), that is an
>>>>> indication that the column is sorted and the distinct counts should be
>>>>> added together.
>>>>>
>>>>> Thanks to Yan, we now have a metrics framework we could use to
>>>>> populate these, although it would take some work to find a good way to
>>>>> estimate the distinct counts. For v2, should we add the distinct counts 
>>>>> map
>>>>> back to file metadata and populate it?
>>>>>
>>>>> Rayn
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>

-- 
Ryan Blue
Tabular

Reply via email to