Hi,
Ryan, i agree on the NDV estimate (approximated number of distinct values)
goal.
Anton, re your question about Trino optimizer.
I don't want to be overly focused on supporting current state to the extent
that we make the future extensions impossible. On the other hand, it's
worth knowing what
The motivation is that some query engines want to at least estimate a
min/max range for distinct value counts. Even if these are imperfect, at
least it is better than no information.
On Fri, Jul 23, 2021 at 4:08 PM Anton Okolnychyi
wrote:
> I am OK returning the metric back as long as it is base
I am OK returning the metric back as long as it is based on writing data and is
an approximation (to avoid too big performance and space overhead on write).
It seems the biggest problem is that metric per file is not useful unless we
query a single file. That’s why we should have an idea how th
Yeah, like Ryan said we are currently thinking about storing secondary
indexes and sketches at the partition level. To do that, we're considering
a new partition-granularity metadata file that has stats that are useful
for job planning and pointers to indexes and sketches.
As for the sketches you
Hey Piotr,
There are a few proposals around secondary indexes floating around[1][2].
The current thinking is that this would be the best place for sketches to
live.
Best,
Ryan
[1]
https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#heading=h.uqr5wcfm85p7
[2]
http
Hi,
File level distinct count (a number) has limited applicability in Trino.
It's useful for pointed queries, where we can prune all the other files
away, but in other cases, Trino optimizer wouldn't be able to make an
educated use of that.
Internally, Łukasz and I we were talking about sketches
I feel it’s better to ensure as much correctness in the statistics as
possible and then to let the engines make educated decisions about how they
want to work on that information.
I agree with this, but I’m wondering where the line is for “as much
correctness … as possible”.
It hadn’t occurred to
Yes I think Dan has a good point here that I was trying to get to, the
correctness aspect of it is the major reason that led me to consider the
upper and lower bound approach, otherwise as Ryan described, the current
count metrics could already be sufficient for planning purposes. With a
bound, at
Jack, that's the same thought I had initially but I think we can actually
break this down into two separate issues.
One is on the scan side which is how do we merge the information that we
have and I think that would you're describing is something that we can do
even without storing the lower and
Jack, I don't think that it is very valuable to keep a range of distinct
values for a single file. The count that we store will be close enough for
planning purposes. The ranges you suggest to estimate the min and max
distinct counts is exactly what I'm getting at: we can estimate the range
based o
I should be more clear about why I didn't suggest a sketch as an option
here. Sketches like HLL that can be merged are too large to go into data
file metadata. For sketches, I think the partition-level stats and
secondary index structure is the right place for those. The question I want
to think th
What about instead of distinct count, we introduce min and max possible
distinct count? In the best case scenario, min and max equals, and we know
exactly how many distinct values there are, and we can directly update the
new distinct count. In the worst case, when merging 2 unsorted files,
without
I would agree with including distinct counts.
As you point out there are a number of strategies that can be employed by
the engine based on additional information. You pointed out the
non-overlapping bounds, but similarly if the bounds overlap almost
entirely, you might be able to assume an even
13 matches
Mail list logo