[
https://issues.apache.org/jira/browse/SPARK-18000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhenhua Wang updated SPARK-18000:
---------------------------------
Description:
For a column, we will generate a equi-width or equi-height histogram, depending
on if its ndv is large than the maximum number of bins allowed in one histogram
(denoted as numBins).
The agg function for a column returns bins - (distinct value, frequency) pairs
of equi-width histogram when the number of distinct values is less than or
equal to numBins. Otherwise, 1) for column of string type, it returns an empty
map; 2) for column of numeric type (including DateType and TimestampType), it
returns endpoints of equi-height histogram - approximate percentiles at
percentages 0.0, 1/numBins, 2/numBins, ..., (numBins-1)/numBins, 1.0.
was:
For a column of numeric type (including date and timestamp), we will generate a
equi-width or equi-height histogram, depending on if its ndv is large than the
maximum number of bins allowed in one histogram (denoted as numBins).
This agg function computes values and their frequencies using a small hashmap,
whose size is less than or equal to "numBins", and returns an equi-width
histogram.
When the size of hashmap exceeds "numBins", it cleans the hashmap and utilizes
ApproximatePercentile to return endpoints of equi-height histogram.
> Aggregation function for computing endpoints for histograms
> -----------------------------------------------------------
>
> Key: SPARK-18000
> URL: https://issues.apache.org/jira/browse/SPARK-18000
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Zhenhua Wang
>
> For a column, we will generate a equi-width or equi-height histogram,
> depending on if its ndv is large than the maximum number of bins allowed in
> one histogram (denoted as numBins).
> The agg function for a column returns bins - (distinct value, frequency)
> pairs of equi-width histogram when the number of distinct values is less than
> or equal to numBins. Otherwise, 1) for column of string type, it returns an
> empty map; 2) for column of numeric type (including DateType and
> TimestampType), it returns endpoints of equi-height histogram - approximate
> percentiles at percentages 0.0, 1/numBins, 2/numBins, ...,
> (numBins-1)/numBins, 1.0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]