What about instead of distinct count, we introduce min and max possible
distinct count? In the best case scenario, min and max equals, and we know
exactly how many distinct values there are, and we can directly update the
new distinct count. In the worst case, when merging 2 unsorted files,
without
I should be more clear about why I didn't suggest a sketch as an option
here. Sketches like HLL that can be merged are too large to go into data
file metadata. For sketches, I think the partition-level stats and
secondary index structure is the right place for those. The question I want
to think th
Jack, I don't think that it is very valuable to keep a range of distinct
values for a single file. The count that we store will be close enough for
planning purposes. The ranges you suggest to estimate the min and max
distinct counts is exactly what I'm getting at: we can estimate the range
based o
Jack, that's the same thought I had initially but I think we can actually
break this down into two separate issues.
One is on the scan side which is how do we merge the information that we
have and I think that would you're describing is something that we can do
even without storing the lower and
Yes I think Dan has a good point here that I was trying to get to, the
correctness aspect of it is the major reason that led me to consider the
upper and lower bound approach, otherwise as Ryan described, the current
count metrics could already be sufficient for planning purposes. With a
bound, at
Hey Sreeram,
I feel like some of your points about why there are row groups are valid,
but there are some really good reasons why you might want to have multiple
row groups in a file (and I can share some situations where it can be
valuable).
If you think historically about how distributed proces
Dan's email gives a lot of great background on why row groups exist and how
they are still useful. I'd add that there are a few considerations for
choosing a row group size that can affect the choice of how many row groups
to target in a file. The first consideration is what Dan pointed out:
larger
I feel it’s better to ensure as much correctness in the statistics as
possible and then to let the engines make educated decisions about how they
want to work on that information.
I agree with this, but I’m wondering where the line is for “as much
correctness … as possible”.
It hadn’t occurred to