Re: [DISCUSS] Distinct count map

Piotr Findeisen Tue, 27 Jul 2021 03:41:55 -0700

Hi,

Ryan, i agree on the NDV estimate (approximated number of distinct values)
goal.


Anton, re your question about Trino optimizer.
I don't want to be overly focused on supporting current state to the extent
that we make the future extensions impossible. On the other hand, it's
worth knowing what information can be consumed easily by the query engine,
and what would require architectural changes in the query engine ('the', as
I speaking from Trino perspective only, cannot comment on others).

The Trino optimizer would want to know the NDV for a relation -- a table,
or  a subset of table in case of filtering.
Having per-partition information would be useful, especially when it can be
combined for multiple partitions (ie a sketch, not just a number).
Having per-file information can be useful as well, unless we end up
processing a lot of small information pieces during planning.
When query is going to access large number  of files, having some
pre-aggregated (eg per-partition) may be the only way to have *something* in
reasonable short time, during query planning.
in  File-per-file proposal my intuition is that this could be prohibitively
expensive to get that information (IO time).

Regarding utilization of Theta sketches for calculating NDV for Joins --
that's definitely very interesting idea. Naturally, it would require
changes to the Trino core to consume this information. Thus i would want to
address more urgent needs first (the reasonably estimated NDVs), and
re-iterate.

Besides NDV there are other information types we may be interested from
Trino perspective, even if not consumed at query planning time, for example
bloom filters.
-- Basically amy kind of information that can be used for
pruning files/splits.

If we further have eg bloom filters at aggregated level (eg per partition),
they can be used for pruning at planning time too.

(Anton, I am aware it doesn't answer your question directly, nor simply.
The simple direct answer would be "it depends" though)

Last but not least, I am concerned about compatibility of readers/writers,
especially about the invalidation of the information.
Consider a case where application A uses pre-aggregated indexes to prune
information (eg partition level bloom filter).
The other application, application B, is not aware of these indexes, so
doesn't update them when writing.
Now, if application A still sees the pre-aggregated indexes and uses them
for pruning, it can return incorrect results.
You have undoubtedly already thought about this, so am curious what the
compatibility-ensuring protocol gonna be.

Best
PF



On Sat, Jul 24, 2021 at 1:11 AM Ryan Blue <b...@tabular.io> wrote:

> The motivation is that some query engines want to at least estimate a
> min/max range for distinct value counts. Even if these are imperfect, at
> least it is better than no information.
>
> On Fri, Jul 23, 2021 at 4:08 PM Anton Okolnychyi
> <aokolnyc...@apple.com.invalid> wrote:
>
>> I am OK returning the metric back as long as it is based on writing data
>> and is an approximation (to avoid too big performance and space overhead on
>> write).
>>
>> It seems the biggest problem is that metric per file is not useful unless
>> we query a single file. That’s why we should have an idea how this per-file
>> metric is going to be used by query engines.
>>
>> Piotr, what kind of information will be useful for the Trino optimizer?
>> Is it per split or per partition?
>>
>> Also a question to Iceberg folks, do we expect to use the per-file metric
>> to build partition-level numbers? Or will it be a separate process?
>> If query engines would rather benefit from partition-level stats and
>> file-level metrics will not help in building partition-level stats, I am
>> not sure adding the per-file column back will give us much.
>>
>> - Anton
>>
>> On 23 Jul 2021, at 08:58, Ryan Blue <b...@tabular.io> wrote:
>>
>> Yeah, like Ryan said we are currently thinking about storing secondary
>> indexes and sketches at the partition level. To do that, we're considering
>> a new partition-granularity metadata file that has stats that are useful
>> for job planning and pointers to indexes and sketches.
>>
>> As for the sketches you suggest, I was thinking more about using Theta
>> sketches instead because they support set intersection that would be
>> helpful for joins.
>>
>> On Fri, Jul 23, 2021 at 2:35 AM Ryan Murray <rym...@gmail.com> wrote:
>>
>>> Hey Piotr,
>>>
>>> There are a few proposals around secondary indexes floating
>>> around[1][2]. The current thinking is that this would be the best place for
>>> sketches to live.
>>>
>>> Best,
>>> Ryan
>>>
>>> [1]
>>> https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7
>>> [2]
>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit
>>>
>>> On Fri, Jul 23, 2021 at 11:11 AM Piotr Findeisen <
>>> pi...@starburstdata.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> File level distinct count (a number) has limited applicability in Trino.
>>>> It's useful for pointed queries, where we can prune all the other files
>>>> away, but in other cases, Trino optimizer wouldn't be able to make an
>>>> educated use of that.
>>>>
>>>> Internally, Łukasz and I we were talking about sketches like HLL as
>>>> well and i am happy to see them being mentioned here too.
>>>> Do you have any design plans for that already?
>>>> Did you consider making them part of file metadata?
>>>>
>>>> Of course for this to be useful, we would need to have a well defined
>>>> hash function (we already have it for bucketing purposes), as well as
>>>> portable representation that can be imported by a query engine.
>>>>
>>>> Best,
>>>> PF
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Jul 3, 2021 at 2:27 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I feel it’s better to ensure as much correctness in the statistics as
>>>>> possible and then to let the engines make educated decisions about how 
>>>>> they
>>>>> want to work on that information.
>>>>>
>>>>> I agree with this, but I’m wondering where the line is for “as much
>>>>> correctness … as possible”.
>>>>>
>>>>> It hadn’t occurred to me that someone might compact two files and also
>>>>> merge the distinct counts. In that case, I completely agree that we should
>>>>> require that the distinct count should be based on the data and not on a
>>>>> calculation from other distinct counts.
>>>>>
>>>>> But, I think it is probably important that these can be set from
>>>>> sketches that don’t require keeping a set of full values. Otherwise, it
>>>>> could be prohibitively expensive to produce them.
>>>>>
>>>>> It probably makes sense to be clear in the docs, like Jack suggests:
>>>>> this is an estimate of the number of distinct values produced from the
>>>>> actual data, not from merging estimates (but possibly from merging the
>>>>> underlying sketches).
>>>>>
>>>>> On Fri, Jul 2, 2021 at 2:49 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> Yes I think Dan has a good point here that I was trying to get to,
>>>>>> the correctness aspect of it is the major reason that led me to consider
>>>>>> the upper and lower bound approach, otherwise as Ryan described, the
>>>>>> current count metrics could already be sufficient for planning purposes.
>>>>>> With a bound, at least that bound can always be calculated correctly 
>>>>>> during
>>>>>> any operation, whereas the distinct value count might drift if we use 
>>>>>> some
>>>>>> heuristics. So if we decide to go with adding the count, it should be a
>>>>>> nullable count such that if we cannot decide the true value it can be
>>>>>> omitted. Or it could directly be defined as a distinct value estimate, 
>>>>>> but
>>>>>> in that case I would prefer to have it as some sort of sketch implemented
>>>>>> as a secondary index.
>>>>>>
>>>>>> -Jack Ye
>>>>>>
>>>>>> On Fri, Jul 2, 2021 at 9:01 AM Daniel Weeks <dwe...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Jack, that's the same thought I had initially but I think we can
>>>>>>> actually break this down into two separate issues.
>>>>>>>
>>>>>>> One is on the scan side which is how do we merge the information
>>>>>>> that we have and I think that would you're describing is something that 
>>>>>>> we
>>>>>>> can do even without storing the lower and upper bounds.  The advantage 
>>>>>>> is
>>>>>>> that on scan side you can actually use other information that you have 
>>>>>>> in
>>>>>>> order to make better decisions about how to do that merging. If all you
>>>>>>> have is lower and upper bounds you may actually lose a little bit of 
>>>>>>> that
>>>>>>> fidelity based on previous merges, compactions, etc.
>>>>>>>
>>>>>>> On the file side I'm a little concerned about using statistics that
>>>>>>> can drift over time. If we're merging files stats can quickly become
>>>>>>> non-representative for the actual data in the files themselves.  Beyond
>>>>>>> merges even compactions can impact the actual stats within the file so 
>>>>>>> in
>>>>>>> many cases you would need to recalculate them anyway.
>>>>>>>
>>>>>>> I feel it's better to ensure as much correctness in the statistics
>>>>>>> as possible and then to let the engines make educated decisions about 
>>>>>>> how
>>>>>>> they want to work on that information.
>>>>>>>
>>>>>>> I'd vote for having distinct counts in the stats but requiring that
>>>>>>> they be accurate. I feel like it's better to require that they're 
>>>>>>> dropped
>>>>>>> in the event that the cannot be accurate.  This may cause some problems
>>>>>>> with row-level deletes though so we have to be a little careful with the
>>>>>>> implementation.
>>>>>>>
>>>>>>> -Dan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 2, 2021, 7:55 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>>
>>>>>>>> What about instead of distinct count, we introduce min and max
>>>>>>>> possible distinct count? In the best case scenario, min and max 
>>>>>>>> equals, and
>>>>>>>> we know exactly how many distinct values there are, and we can directly
>>>>>>>> update the new distinct count. In the worst case, when merging 2 
>>>>>>>> unsorted
>>>>>>>> files, without the need for complicated estimation, we can know the 
>>>>>>>> max for
>>>>>>>> file1 and file2 is (max_file1 + max_file2), and the min is 
>>>>>>>> max(min_file1,
>>>>>>>> min_file2). If files are merged without sort order, then this gap will
>>>>>>>> continue to grow and become unable to provide as much useful 
>>>>>>>> information to
>>>>>>>> planning. But when we perform a sort for rows in files, we can update 
>>>>>>>> the
>>>>>>>> min and max to the same and reduce this gap and improve planning.
>>>>>>>>
>>>>>>>> -Jack Ye
>>>>>>>>
>>>>>>>> On Thu, Jul 1, 2021 at 8:29 PM Daniel Weeks <dwe...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I would agree with including distinct counts.
>>>>>>>>>
>>>>>>>>> As you point out there are a number of strategies that can be
>>>>>>>>> employed by the engine based on additional information.  You pointed 
>>>>>>>>> out
>>>>>>>>> the non-overlapping bounds, but similarly if the bounds overlap almost
>>>>>>>>> entirely, you might be able to assume an even distribution and average
>>>>>>>>> them.  If the delta between lower and upper bounds overall are 
>>>>>>>>> narrow, you
>>>>>>>>> might even be able to choose the max value (at least for whole 
>>>>>>>>> numbers).
>>>>>>>>>
>>>>>>>>> Another alternative would be to use an approx distinct with some
>>>>>>>>> form of sketch/digest that would allow for better merging, but I feel 
>>>>>>>>> the
>>>>>>>>> tradeoff in space/complexity may not net out to better overall 
>>>>>>>>> outcomes.
>>>>>>>>>
>>>>>>>>> -Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 1, 2021 at 5:58 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I'm working on finalizing the spec for v2 right now and one thing
>>>>>>>>>> that's outstanding is the map of file-level distinct counts.
>>>>>>>>>>
>>>>>>>>>> This field has some history. I added it in the original spec
>>>>>>>>>> because I thought we'd want distinct value counts for cost-based
>>>>>>>>>> optimization in SQL planners. But we later removed it because the 
>>>>>>>>>> counts
>>>>>>>>>> aren't mergeable, making it hard to determine what to do with 
>>>>>>>>>> file-level
>>>>>>>>>> distinct counts. In some cases, you'd want to add them together (when
>>>>>>>>>> sorted by the column) and in others you'd want to use the max across 
>>>>>>>>>> files.
>>>>>>>>>> I thought that the idea of having counts was misguided, so we 
>>>>>>>>>> removed the
>>>>>>>>>> column.
>>>>>>>>>>
>>>>>>>>>> I've recently talked with people working on SQL planners and they
>>>>>>>>>> suggested adding the column back and populating it because even 
>>>>>>>>>> distinct
>>>>>>>>>> counts that are hard to work with are better than nothing.
>>>>>>>>>>
>>>>>>>>>> There may also be heuristics for working with the counts that
>>>>>>>>>> make it possible to get decent estimates across files. For example, 
>>>>>>>>>> if the
>>>>>>>>>> column bounds do not overlap between files (like 0-10, 11-20, 
>>>>>>>>>> 21-30), that
>>>>>>>>>> is an indication that the column is sorted and the distinct counts 
>>>>>>>>>> should
>>>>>>>>>> be added together.
>>>>>>>>>>
>>>>>>>>>> Thanks to Yan, we now have a metrics framework we could use to
>>>>>>>>>> populate these, although it would take some work to find a good way 
>>>>>>>>>> to
>>>>>>>>>> estimate the distinct counts. For v2, should we add the distinct 
>>>>>>>>>> counts map
>>>>>>>>>> back to file metadata and populate it?
>>>>>>>>>>
>>>>>>>>>> Rayn
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] Distinct count map

Reply via email to