Re: [DISCUSS] Distinct count map

Ryan Murray Fri, 23 Jul 2021 02:35:15 -0700

Hey Piotr,

There are a few proposals around secondary indexes floating around[1][2].
The current thinking is that this would be the best place for sketches to
live.


Best,
Ryan

[1]
https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7
[2]
https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit

On Fri, Jul 23, 2021 at 11:11 AM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi,
>
> File level distinct count (a number) has limited applicability in Trino.
> It's useful for pointed queries, where we can prune all the other files
> away, but in other cases, Trino optimizer wouldn't be able to make an
> educated use of that.
>
> Internally, Łukasz and I we were talking about sketches like HLL as well
> and i am happy to see them being mentioned here too.
> Do you have any design plans for that already?
> Did you consider making them part of file metadata?
>
> Of course for this to be useful, we would need to have a well defined hash
> function (we already have it for bucketing purposes), as well as portable
> representation that can be imported by a query engine.
>
> Best,
> PF
>
>
>
>
>
> On Sat, Jul 3, 2021 at 2:27 AM Ryan Blue <b...@tabular.io> wrote:
>
>> I feel it’s better to ensure as much correctness in the statistics as
>> possible and then to let the engines make educated decisions about how they
>> want to work on that information.
>>
>> I agree with this, but I’m wondering where the line is for “as much
>> correctness … as possible”.
>>
>> It hadn’t occurred to me that someone might compact two files and also
>> merge the distinct counts. In that case, I completely agree that we should
>> require that the distinct count should be based on the data and not on a
>> calculation from other distinct counts.
>>
>> But, I think it is probably important that these can be set from sketches
>> that don’t require keeping a set of full values. Otherwise, it could be
>> prohibitively expensive to produce them.
>>
>> It probably makes sense to be clear in the docs, like Jack suggests: this
>> is an estimate of the number of distinct values produced from the actual
>> data, not from merging estimates (but possibly from merging the underlying
>> sketches).
>>
>> On Fri, Jul 2, 2021 at 2:49 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Yes I think Dan has a good point here that I was trying to get to, the
>>> correctness aspect of it is the major reason that led me to consider the
>>> upper and lower bound approach, otherwise as Ryan described, the current
>>> count metrics could already be sufficient for planning purposes. With a
>>> bound, at least that bound can always be calculated correctly during any
>>> operation, whereas the distinct value count might drift if we use some
>>> heuristics. So if we decide to go with adding the count, it should be a
>>> nullable count such that if we cannot decide the true value it can be
>>> omitted. Or it could directly be defined as a distinct value estimate, but
>>> in that case I would prefer to have it as some sort of sketch implemented
>>> as a secondary index.
>>>
>>> -Jack Ye
>>>
>>> On Fri, Jul 2, 2021 at 9:01 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> Jack, that's the same thought I had initially but I think we can
>>>> actually break this down into two separate issues.
>>>>
>>>> One is on the scan side which is how do we merge the information that
>>>> we have and I think that would you're describing is something that we can
>>>> do even without storing the lower and upper bounds.  The advantage is that
>>>> on scan side you can actually use other information that you have in order
>>>> to make better decisions about how to do that merging. If all you have is
>>>> lower and upper bounds you may actually lose a little bit of that fidelity
>>>> based on previous merges, compactions, etc.
>>>>
>>>> On the file side I'm a little concerned about using statistics that can
>>>> drift over time. If we're merging files stats can quickly become
>>>> non-representative for the actual data in the files themselves.  Beyond
>>>> merges even compactions can impact the actual stats within the file so in
>>>> many cases you would need to recalculate them anyway.
>>>>
>>>> I feel it's better to ensure as much correctness in the statistics as
>>>> possible and then to let the engines make educated decisions about how they
>>>> want to work on that information.
>>>>
>>>> I'd vote for having distinct counts in the stats but requiring that
>>>> they be accurate. I feel like it's better to require that they're dropped
>>>> in the event that the cannot be accurate.  This may cause some problems
>>>> with row-level deletes though so we have to be a little careful with the
>>>> implementation.
>>>>
>>>> -Dan
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 2, 2021, 7:55 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> What about instead of distinct count, we introduce min and max
>>>>> possible distinct count? In the best case scenario, min and max equals, 
>>>>> and
>>>>> we know exactly how many distinct values there are, and we can directly
>>>>> update the new distinct count. In the worst case, when merging 2 unsorted
>>>>> files, without the need for complicated estimation, we can know the max 
>>>>> for
>>>>> file1 and file2 is (max_file1 + max_file2), and the min is max(min_file1,
>>>>> min_file2). If files are merged without sort order, then this gap will
>>>>> continue to grow and become unable to provide as much useful information 
>>>>> to
>>>>> planning. But when we perform a sort for rows in files, we can update the
>>>>> min and max to the same and reduce this gap and improve planning.
>>>>>
>>>>> -Jack Ye
>>>>>
>>>>> On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>>>
>>>>>> I would agree with including distinct counts.
>>>>>>
>>>>>> As you point out there are a number of strategies that can be
>>>>>> employed by the engine based on additional information.  You pointed out
>>>>>> the non-overlapping bounds, but similarly if the bounds overlap almost
>>>>>> entirely, you might be able to assume an even distribution and average
>>>>>> them.  If the delta between lower and upper bounds overall are narrow, 
>>>>>> you
>>>>>> might even be able to choose the max value (at least for whole numbers).
>>>>>>
>>>>>> Another alternative would be to use an approx distinct with some form
>>>>>> of sketch/digest that would allow for better merging, but I feel the
>>>>>> tradeoff in space/complexity may not net out to better overall outcomes.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I'm working on finalizing the spec for v2 right now and one thing
>>>>>>> that's outstanding is the map of file-level distinct counts.
>>>>>>>
>>>>>>> This field has some history. I added it in the original spec because
>>>>>>> I thought we'd want distinct value counts for cost-based optimization in
>>>>>>> SQL planners. But we later removed it because the counts aren't 
>>>>>>> mergeable,
>>>>>>> making it hard to determine what to do with file-level distinct counts. 
>>>>>>> In
>>>>>>> some cases, you'd want to add them together (when sorted by the column) 
>>>>>>> and
>>>>>>> in others you'd want to use the max across files. I thought that the 
>>>>>>> idea
>>>>>>> of having counts was misguided, so we removed the column.
>>>>>>>
>>>>>>> I've recently talked with people working on SQL planners and they
>>>>>>> suggested adding the column back and populating it because even distinct
>>>>>>> counts that are hard to work with are better than nothing.
>>>>>>>
>>>>>>> There may also be heuristics for working with the counts that make
>>>>>>> it possible to get decent estimates across files. For example, if the
>>>>>>> column bounds do not overlap between files (like 0-10, 11-20, 21-30), 
>>>>>>> that
>>>>>>> is an indication that the column is sorted and the distinct counts 
>>>>>>> should
>>>>>>> be added together.
>>>>>>>
>>>>>>> Thanks to Yan, we now have a metrics framework we could use to
>>>>>>> populate these, although it would take some work to find a good way to
>>>>>>> estimate the distinct counts. For v2, should we add the distinct counts 
>>>>>>> map
>>>>>>> back to file metadata and populate it?
>>>>>>>
>>>>>>> Rayn
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: [DISCUSS] Distinct count map

Reply via email to