Re: [DISCUSS] Distinct count map

Ryan Blue Fri, 23 Jul 2021 16:11:23 -0700

The motivation is that some query engines want to at least estimate a
min/max range for distinct value counts. Even if these are imperfect, at
least it is better than no information.


On Fri, Jul 23, 2021 at 4:08 PM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> I am OK returning the metric back as long as it is based on writing data
> and is an approximation (to avoid too big performance and space overhead on
> write).
>
> It seems the biggest problem is that metric per file is not useful unless
> we query a single file. That’s why we should have an idea how this per-file
> metric is going to be used by query engines.
>
> Piotr, what kind of information will be useful for the Trino optimizer? Is
> it per split or per partition?
>
> Also a question to Iceberg folks, do we expect to use the per-file metric
> to build partition-level numbers? Or will it be a separate process?
> If query engines would rather benefit from partition-level stats and
> file-level metrics will not help in building partition-level stats, I am
> not sure adding the per-file column back will give us much.
>
> - Anton
>
> On 23 Jul 2021, at 08:58, Ryan Blue <b...@tabular.io> wrote:
>
> Yeah, like Ryan said we are currently thinking about storing secondary
> indexes and sketches at the partition level. To do that, we're considering
> a new partition-granularity metadata file that has stats that are useful
> for job planning and pointers to indexes and sketches.
>
> As for the sketches you suggest, I was thinking more about using Theta
> sketches instead because they support set intersection that would be
> helpful for joins.
>
> On Fri, Jul 23, 2021 at 2:35 AM Ryan Murray <rym...@gmail.com> wrote:
>
>> Hey Piotr,
>>
>> There are a few proposals around secondary indexes floating around[1][2].
>> The current thinking is that this would be the best place for sketches to
>> live.
>>
>> Best,
>> Ryan
>>
>> [1]
>> https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7
>> [2]
>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit
>>
>> On Fri, Jul 23, 2021 at 11:11 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>>> Hi,
>>>
>>> File level distinct count (a number) has limited applicability in Trino.
>>> It's useful for pointed queries, where we can prune all the other files
>>> away, but in other cases, Trino optimizer wouldn't be able to make an
>>> educated use of that.
>>>
>>> Internally, Łukasz and I we were talking about sketches like HLL as well
>>> and i am happy to see them being mentioned here too.
>>> Do you have any design plans for that already?
>>> Did you consider making them part of file metadata?
>>>
>>> Of course for this to be useful, we would need to have a well defined
>>> hash function (we already have it for bucketing purposes), as well as
>>> portable representation that can be imported by a query engine.
>>>
>>> Best,
>>> PF
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Jul 3, 2021 at 2:27 AM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> I feel it’s better to ensure as much correctness in the statistics as
>>>> possible and then to let the engines make educated decisions about how they
>>>> want to work on that information.
>>>>
>>>> I agree with this, but I’m wondering where the line is for “as much
>>>> correctness … as possible”.
>>>>
>>>> It hadn’t occurred to me that someone might compact two files and also
>>>> merge the distinct counts. In that case, I completely agree that we should
>>>> require that the distinct count should be based on the data and not on a
>>>> calculation from other distinct counts.
>>>>
>>>> But, I think it is probably important that these can be set from
>>>> sketches that don’t require keeping a set of full values. Otherwise, it
>>>> could be prohibitively expensive to produce them.
>>>>
>>>> It probably makes sense to be clear in the docs, like Jack suggests:
>>>> this is an estimate of the number of distinct values produced from the
>>>> actual data, not from merging estimates (but possibly from merging the
>>>> underlying sketches).
>>>>
>>>> On Fri, Jul 2, 2021 at 2:49 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> Yes I think Dan has a good point here that I was trying to get to, the
>>>>> correctness aspect of it is the major reason that led me to consider the
>>>>> upper and lower bound approach, otherwise as Ryan described, the current
>>>>> count metrics could already be sufficient for planning purposes. With a
>>>>> bound, at least that bound can always be calculated correctly during any
>>>>> operation, whereas the distinct value count might drift if we use some
>>>>> heuristics. So if we decide to go with adding the count, it should be a
>>>>> nullable count such that if we cannot decide the true value it can be
>>>>> omitted. Or it could directly be defined as a distinct value estimate, but
>>>>> in that case I would prefer to have it as some sort of sketch implemented
>>>>> as a secondary index.
>>>>>
>>>>> -Jack Ye
>>>>>
>>>>> On Fri, Jul 2, 2021 at 9:01 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>>>
>>>>>> Jack, that's the same thought I had initially but I think we can
>>>>>> actually break this down into two separate issues.
>>>>>>
>>>>>> One is on the scan side which is how do we merge the information that
>>>>>> we have and I think that would you're describing is something that we can
>>>>>> do even without storing the lower and upper bounds.  The advantage is 
>>>>>> that
>>>>>> on scan side you can actually use other information that you have in 
>>>>>> order
>>>>>> to make better decisions about how to do that merging. If all you have is
>>>>>> lower and upper bounds you may actually lose a little bit of that 
>>>>>> fidelity
>>>>>> based on previous merges, compactions, etc.
>>>>>>
>>>>>> On the file side I'm a little concerned about using statistics that
>>>>>> can drift over time. If we're merging files stats can quickly become
>>>>>> non-representative for the actual data in the files themselves.  Beyond
>>>>>> merges even compactions can impact the actual stats within the file so in
>>>>>> many cases you would need to recalculate them anyway.
>>>>>>
>>>>>> I feel it's better to ensure as much correctness in the statistics as
>>>>>> possible and then to let the engines make educated decisions about how 
>>>>>> they
>>>>>> want to work on that information.
>>>>>>
>>>>>> I'd vote for having distinct counts in the stats but requiring that
>>>>>> they be accurate. I feel like it's better to require that they're dropped
>>>>>> in the event that the cannot be accurate.  This may cause some problems
>>>>>> with row-level deletes though so we have to be a little careful with the
>>>>>> implementation.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 2, 2021, 7:55 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> What about instead of distinct count, we introduce min and max
>>>>>>> possible distinct count? In the best case scenario, min and max equals, 
>>>>>>> and
>>>>>>> we know exactly how many distinct values there are, and we can directly
>>>>>>> update the new distinct count. In the worst case, when merging 2 
>>>>>>> unsorted
>>>>>>> files, without the need for complicated estimation, we can know the max 
>>>>>>> for
>>>>>>> file1 and file2 is (max_file1 + max_file2), and the min is 
>>>>>>> max(min_file1,
>>>>>>> min_file2). If files are merged without sort order, then this gap will
>>>>>>> continue to grow and become unable to provide as much useful 
>>>>>>> information to
>>>>>>> planning. But when we perform a sort for rows in files, we can update 
>>>>>>> the
>>>>>>> min and max to the same and reduce this gap and improve planning.
>>>>>>>
>>>>>>> -Jack Ye
>>>>>>>
>>>>>>> On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would agree with including distinct counts.
>>>>>>>>
>>>>>>>> As you point out there are a number of strategies that can be
>>>>>>>> employed by the engine based on additional information.  You pointed 
>>>>>>>> out
>>>>>>>> the non-overlapping bounds, but similarly if the bounds overlap almost
>>>>>>>> entirely, you might be able to assume an even distribution and average
>>>>>>>> them.  If the delta between lower and upper bounds overall are narrow, 
>>>>>>>> you
>>>>>>>> might even be able to choose the max value (at least for whole 
>>>>>>>> numbers).
>>>>>>>>
>>>>>>>> Another alternative would be to use an approx distinct with some
>>>>>>>> form of sketch/digest that would allow for better merging, but I feel 
>>>>>>>> the
>>>>>>>> tradeoff in space/complexity may not net out to better overall 
>>>>>>>> outcomes.
>>>>>>>>
>>>>>>>> -Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I'm working on finalizing the spec for v2 right now and one thing
>>>>>>>>> that's outstanding is the map of file-level distinct counts.
>>>>>>>>>
>>>>>>>>> This field has some history. I added it in the original spec
>>>>>>>>> because I thought we'd want distinct value counts for cost-based
>>>>>>>>> optimization in SQL planners. But we later removed it because the 
>>>>>>>>> counts
>>>>>>>>> aren't mergeable, making it hard to determine what to do with 
>>>>>>>>> file-level
>>>>>>>>> distinct counts. In some cases, you'd want to add them together (when
>>>>>>>>> sorted by the column) and in others you'd want to use the max across 
>>>>>>>>> files.
>>>>>>>>> I thought that the idea of having counts was misguided, so we removed 
>>>>>>>>> the
>>>>>>>>> column.
>>>>>>>>>
>>>>>>>>> I've recently talked with people working on SQL planners and they
>>>>>>>>> suggested adding the column back and populating it because even 
>>>>>>>>> distinct
>>>>>>>>> counts that are hard to work with are better than nothing.
>>>>>>>>>
>>>>>>>>> There may also be heuristics for working with the counts that make
>>>>>>>>> it possible to get decent estimates across files. For example, if the
>>>>>>>>> column bounds do not overlap between files (like 0-10, 11-20, 21-30), 
>>>>>>>>> that
>>>>>>>>> is an indication that the column is sorted and the distinct counts 
>>>>>>>>> should
>>>>>>>>> be added together.
>>>>>>>>>
>>>>>>>>> Thanks to Yan, we now have a metrics framework we could use to
>>>>>>>>> populate these, although it would take some work to find a good way to
>>>>>>>>> estimate the distinct counts. For v2, should we add the distinct 
>>>>>>>>> counts map
>>>>>>>>> back to file metadata and populate it?
>>>>>>>>>
>>>>>>>>> Rayn
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>
> --
> Ryan Blue
> Tabular
>
>
>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Distinct count map

Reply via email to