[ 
https://issues.apache.org/jira/browse/FLINK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212508#comment-15212508
 ] 

Fabian Hueske commented on FLINK-3664:
--------------------------------------

The number of distinct values can be approximated with the HyperLogLog 
algorithm. 
I think a first version is fine without distinct value counts though. The 
approach sketched in the design doc looks quite extensible so that distinct 
counts and possibly other metrics can be added later.


> Create a method to easily Summarize a DataSet
> ---------------------------------------------
>
>                 Key: FLINK-3664
>                 URL: https://issues.apache.org/jira/browse/FLINK-3664
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>         Attachments: DataSet-Summary-Design-March2016-v1.txt
>
>
> Here is an example:
> {code}
> /**
>  * Summarize a DataSet of Tuples by collecting single pass statistics for all 
> columns
>  */
> public Tuple summarize()
> Dataset<Tuple3<Double, String, Boolean>> input = // [...]
> Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> summary 
> = input.summarize()
> summary.getField(0).stddev()
> summary.getField(1).maxStringLength()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to