Re: [DISCUSS] suggest using granularityNumber in ColumnStats

Becket Qin Sat, 04 Jun 2022 01:54:06 -0700

Hi Jing,

Hmm, granularity and ndv still don't seem to mean the same thing to me.
Granularity basically means how detailed the data is, in another word,
whether a field / column be further divided. For example, a field like
"age“ cannot be further divided so it is quite granular. In contrast, an
"address" field can be further divided into "street", "city", "country",
etc. Therefore "address" is less granular. When it comes to NDV, it
actually means how many distinct values are there in the field / column,
which is orthogonal to the granularity.


Anyways, it looks like most people think NDV or its full phrase is a better
name. It probably makes sense to just use either of them.

Thanks,

Jiangjie (Becket) Qin


On Fri, Jun 3, 2022 at 9:45 PM Jark Wu <[email protected]> wrote:

> Hi Jing,
>
> I agree with you that "NDV is more SQL-oriented(implementation)
> and granularity is more data analytics-oriented". As you said,
> "granularity"
> may be commonly used for data modeling and business-related.
> However, TableStats is not used for data modeling but is an implementation
>  detail for SQL optimization. NDV is the terminology in the optimizer
> field,
> and Calcite also uses this word[1]. I didn't notice there any vendors are
> using "granularity" for this purpose. If I miss any, please correct me.
>
> If NDV sounds like a function to you, I'm OK to use "numDistinctVals" as
> Calcite does.
>
> Best,
> Jark
>
>
> [1]:
>
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdUtil.html#numDistinctVals(java.lang.Double,java.lang.Double)
>
> On Fri, 3 Jun 2022 at 00:14, Jing Ge <[email protected]> wrote:
>
> > Thanks all for your feedback! It is very informative.
> >
> > to Becket:
> >
> > At the beginning, I chose the same word because we used it in daily work.
> > Before I started this discussion, to make sure it is the right one, I did
> > some checking and it turns out that *cardinality* has a very different
> > (also very common) meaning within data modeling[1]. And on the other side
> > *granularity* is actually the right word for the meaning when we use
> > cardinality in the context of NDV[2].
> >
> > to Jark, Jingsong,
> >
> > NDV seems to me more like a function than a field defined in a class.
> > Briefly speaking, NDV is more SQL-oriented(implementation) and
> > *granularity* is more data analytics-oriented(abstraction/concept)[3][4].
> >
> > Best regards,
> > Jing
> >
> > [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
> > [2] https://www.talon.one/glossary/granularity
> > [3] https://www.quora.com/What-is-granularity-in-database
> > [4] https://www.statisticshowto.com/data-granularity/
> >
> > On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > +1 for NDV (number of distinct values) is a widely used terminology in
> > > table statistics.
> > >
> > > I've also seen the one called `distinctCount`.
> > >
> > > This name can be found in databases like oracle too. [1]
> > >
> > > So it is not good to change a completely different name.
> > >
> > > [1]
> > >
> > >
> >
> https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu <[email protected]> wrote:
> > >
> > > > Hi Jing,
> > > >
> > > > I can see there might be developers who don't understand the meaning
> at
> > > the
> > > > first glance.
> > > > However, NDV is a widely used terminology in table statistics, see
> > > > [1][2][3].
> > > > If we use another name, it may confuse developers who are familiar
> with
> > > > stats and optimization.
> > > > I think at least, the Javadoc is needed to explain the meaning and
> full
> > > > name.
> > > > If we want to change the name, we can use the full name
> > > > "numberOfDistinctValues()".
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > [1]:
> > > >
> > > >
> > >
> >
> https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
> > > > [2]:
> > > >
> > >
> >
> https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
> > > > [3]:
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > >
> > > > On Thu, 2 Jun 2022 at 14:44, Becket Qin <[email protected]>
> wrote:
> > > >
> > > > > Hi Jing,
> > > > >
> > > > > While I do agree that NDV is a little confusing at first sight, it
> > > seems
> > > > > quite concise once I got the meaning. So personally I am OK with
> > > keeping
> > > > it
> > > > > as is, but proper documentation would be helpful. If we really want
> > to
> > > > > replace it with a more professional name, *cardinality* might be a
> > good
> > > > > alternative.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge <[email protected]>
> wrote:
> > > > >
> > > > > > Hi Dev,
> > > > > >
> > > > > > I am not really sure if it is feasible to start this discussion.
> > > > > According
> > > > > > to the contribution guidelines, dev ml is the right place to
> reach
> > > > > > consensus.
> > > > > >
> > > > > > In ColumnStats, Currently ndv, which stands for "number of
> distinct
> > > > > > values", is used. First of all, it is difficult to understand the
> > > > meaning
> > > > > > with the abbreviation. Second, it might be good to use a
> > professional
> > > > > > naming instead.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Suggestion:
> > > > > >
> > > > > > replace ndv with granularityNumber:
> > > > > >
> > > > > >
> > > > > >
> > > > > > The good news, afaik, is that the method getNdv() hasn't been
> used
> > > > within
> > > > > > Flink which means the renaming will have very limited impact.
> > > > > >
> > > > > >
> > > > > >
> > > > > > ColumnStats {
> > > > > >
> > > > > > /** number of distinct values. */
> > > > > >
> > > > > > @Deprecated
> > > > > > private final Long ndv;
> > > > > >
> > > > > >
> > > > > >
> > > > > > /**Granularity refers to the level of details used to sort and
> > > separate
> > > > > > data at column level. Highly granular data is categorized or
> > > separated
> > > > > very
> > > > > > precisely. For example, the granularity number of gender columns
> > > should
> > > > > > normally be 2. The granularity number of the month column will be
> > 12.
> > > > In
> > > > > > the SQL world, it means the number of distinct values. */
> > > > > >
> > > > > > private final Long granularityNumber;
> > > > > >
> > > > > >
> > > > > >
> > > > > > @Deprecated
> > > > > > public Long getNdv()
> > > > > > { return ndv; }
> > > > > >
> > > > > >
> > > > > >
> > > > > > public Long getGranularityNumber()
> > > > > > { return granularityNumber; }
> > > > > > }
> > > > > >
> > > > > > Best regards,
> > > > > > --
> > > > > >
> > > > > > Jing
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

Reply via email to