Re: [DISCUSS] Distinct count map

Ryan Blue Fri, 23 Jul 2021 08:59:08 -0700

Yeah, like Ryan said we are currently thinking about storing secondary
indexes and sketches at the partition level. To do that, we're considering
a new partition-granularity metadata file that has stats that are useful
for job planning and pointers to indexes and sketches.


As for the sketches you suggest, I was thinking more about using Theta
sketches instead because they support set intersection that would be
helpful for joins.

On Fri, Jul 23, 2021 at 2:35 AM Ryan Murray <rym...@gmail.com> wrote:

> Hey Piotr,
>
> There are a few proposals around secondary indexes floating around[1][2].
> The current thinking is that this would be the best place for sketches to
> live.
>
> Best,
> Ryan
>
> [1]
> https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7
> [2]
> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit
>
> On Fri, Jul 23, 2021 at 11:11 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi,
>>
>> File level distinct count (a number) has limited applicability in Trino.
>> It's useful for pointed queries, where we can prune all the other files
>> away, but in other cases, Trino optimizer wouldn't be able to make an
>> educated use of that.
>>
>> Internally, Łukasz and I we were talking about sketches like HLL as well
>> and i am happy to see them being mentioned here too.
>> Do you have any design plans for that already?
>> Did you consider making them part of file metadata?
>>
>> Of course for this to be useful, we would need to have a well defined
>> hash function (we already have it for bucketing purposes), as well as
>> portable representation that can be imported by a query engine.
>>
>> Best,
>> PF
>>
>>
>>
>>
>>
>> On Sat, Jul 3, 2021 at 2:27 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> I feel it’s better to ensure as much correctness in the statistics as
>>> possible and then to let the engines make educated decisions about how they
>>> want to work on that information.
>>>
>>> I agree with this, but I’m wondering where the line is for “as much
>>> correctness … as possible”.
>>>
>>> It hadn’t occurred to me that someone might compact two files and also
>>> merge the distinct counts. In that case, I completely agree that we should
>>> require that the distinct count should be based on the data and not on a
>>> calculation from other distinct counts.
>>>
>>> But, I think it is probably important that these can be set from
>>> sketches that don’t require keeping a set of full values. Otherwise, it
>>> could be prohibitively expensive to produce them.
>>>
>>> It probably makes sense to be clear in the docs, like Jack suggests:
>>> this is an estimate of the number of distinct values produced from the
>>> actual data, not from merging estimates (but possibly from merging the
>>> underlying sketches).
>>>
>>> On Fri, Jul 2, 2021 at 2:49 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Yes I think Dan has a good point here that I was trying to get to, the
>>>> correctness aspect of it is the major reason that led me to consider the
>>>> upper and lower bound approach, otherwise as Ryan described, the current
>>>> count metrics could already be sufficient for planning purposes. With a
>>>> bound, at least that bound can always be calculated correctly during any
>>>> operation, whereas the distinct value count might drift if we use some
>>>> heuristics. So if we decide to go with adding the count, it should be a
>>>> nullable count such that if we cannot decide the true value it can be
>>>> omitted. Or it could directly be defined as a distinct value estimate, but
>>>> in that case I would prefer to have it as some sort of sketch implemented
>>>> as a secondary index.
>>>>
>>>> -Jack Ye
>>>>
>>>> On Fri, Jul 2, 2021 at 9:01 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>>
>>>>> Jack, that's the same thought I had initially but I think we can
>>>>> actually break this down into two separate issues.
>>>>>
>>>>> One is on the scan side which is how do we merge the information that
>>>>> we have and I think that would you're describing is something that we can
>>>>> do even without storing the lower and upper bounds.  The advantage is that
>>>>> on scan side you can actually use other information that you have in order
>>>>> to make better decisions about how to do that merging. If all you have is
>>>>> lower and upper bounds you may actually lose a little bit of that fidelity
>>>>> based on previous merges, compactions, etc.
>>>>>
>>>>> On the file side I'm a little concerned about using statistics that
>>>>> can drift over time. If we're merging files stats can quickly become
>>>>> non-representative for the actual data in the files themselves.  Beyond
>>>>> merges even compactions can impact the actual stats within the file so in
>>>>> many cases you would need to recalculate them anyway.
>>>>>
>>>>> I feel it's better to ensure as much correctness in the statistics as
>>>>> possible and then to let the engines make educated decisions about how 
>>>>> they
>>>>> want to work on that information.
>>>>>
>>>>> I'd vote for having distinct counts in the stats but requiring that
>>>>> they be accurate. I feel like it's better to require that they're dropped
>>>>> in the event that the cannot be accurate.  This may cause some problems
>>>>> with row-level deletes though so we have to be a little careful with the
>>>>> implementation.
>>>>>
>>>>> -Dan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 2, 2021, 7:55 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> What about instead of distinct count, we introduce min and max
>>>>>> possible distinct count? In the best case scenario, min and max equals, 
>>>>>> and
>>>>>> we know exactly how many distinct values there are, and we can directly
>>>>>> update the new distinct count. In the worst case, when merging 2 unsorted
>>>>>> files, without the need for complicated estimation, we can know the max 
>>>>>> for
>>>>>> file1 and file2 is (max_file1 + max_file2), and the min is max(min_file1,
>>>>>> min_file2). If files are merged without sort order, then this gap will
>>>>>> continue to grow and become unable to provide as much useful information 
>>>>>> to
>>>>>> planning. But when we perform a sort for rows in files, we can update the
>>>>>> min and max to the same and reduce this gap and improve planning.
>>>>>>
>>>>>> -Jack Ye
>>>>>>
>>>>>> On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I would agree with including distinct counts.
>>>>>>>
>>>>>>> As you point out there are a number of strategies that can be
>>>>>>> employed by the engine based on additional information.  You pointed out
>>>>>>> the non-overlapping bounds, but similarly if the bounds overlap almost
>>>>>>> entirely, you might be able to assume an even distribution and average
>>>>>>> them.  If the delta between lower and upper bounds overall are narrow, 
>>>>>>> you
>>>>>>> might even be able to choose the max value (at least for whole numbers).
>>>>>>>
>>>>>>> Another alternative would be to use an approx distinct with some
>>>>>>> form of sketch/digest that would allow for better merging, but I feel 
>>>>>>> the
>>>>>>> tradeoff in space/complexity may not net out to better overall outcomes.
>>>>>>>
>>>>>>> -Dan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I'm working on finalizing the spec for v2 right now and one thing
>>>>>>>> that's outstanding is the map of file-level distinct counts.
>>>>>>>>
>>>>>>>> This field has some history. I added it in the original spec
>>>>>>>> because I thought we'd want distinct value counts for cost-based
>>>>>>>> optimization in SQL planners. But we later removed it because the 
>>>>>>>> counts
>>>>>>>> aren't mergeable, making it hard to determine what to do with 
>>>>>>>> file-level
>>>>>>>> distinct counts. In some cases, you'd want to add them together (when
>>>>>>>> sorted by the column) and in others you'd want to use the max across 
>>>>>>>> files.
>>>>>>>> I thought that the idea of having counts was misguided, so we removed 
>>>>>>>> the
>>>>>>>> column.
>>>>>>>>
>>>>>>>> I've recently talked with people working on SQL planners and they
>>>>>>>> suggested adding the column back and populating it because even 
>>>>>>>> distinct
>>>>>>>> counts that are hard to work with are better than nothing.
>>>>>>>>
>>>>>>>> There may also be heuristics for working with the counts that make
>>>>>>>> it possible to get decent estimates across files. For example, if the
>>>>>>>> column bounds do not overlap between files (like 0-10, 11-20, 21-30), 
>>>>>>>> that
>>>>>>>> is an indication that the column is sorted and the distinct counts 
>>>>>>>> should
>>>>>>>> be added together.
>>>>>>>>
>>>>>>>> Thanks to Yan, we now have a metrics framework we could use to
>>>>>>>> populate these, although it would take some work to find a good way to
>>>>>>>> estimate the distinct counts. For v2, should we add the distinct 
>>>>>>>> counts map
>>>>>>>> back to file metadata and populate it?
>>>>>>>>
>>>>>>>> Rayn
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Distinct count map

Reply via email to