[ https://issues.apache.org/jira/browse/FLINK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212508#comment-15212508 ]
Fabian Hueske commented on FLINK-3664: -------------------------------------- The number of distinct values can be approximated with the HyperLogLog algorithm. I think a first version is fine without distinct value counts though. The approach sketched in the design doc looks quite extensible so that distinct counts and possibly other metrics can be added later. > Create a method to easily Summarize a DataSet > --------------------------------------------- > > Key: FLINK-3664 > URL: https://issues.apache.org/jira/browse/FLINK-3664 > Project: Flink > Issue Type: Improvement > Reporter: Todd Lisonbee > Attachments: DataSet-Summary-Design-March2016-v1.txt > > > Here is an example: > {code} > /** > * Summarize a DataSet of Tuples by collecting single pass statistics for all > columns > */ > public Tuple summarize() > Dataset<Tuple3<Double, String, Boolean>> input = // [...] > Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> summary > = input.summarize() > summary.getField(0).stddev() > summary.getField(1).maxStringLength() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)