I wrote another design for a summarize() function on DataSet. https://issues.apache.org/jira/browse/FLINK-3664
I think this would be a better place for me to start than working on generic Aggregations. (I could move ahead with it immediately and there are no tricky decisions if people more or less liked the design). Any support for a summarize() function? // Summarize a DataSet of Tuples by collecting single pass statistics for all columns // example usage: Dataset<Tuple3<Double, String, Boolean>> input = // [...] Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> summary = input.summarize() summary.getField(0).stddev() summary.getField(1).maxStringLength() Thanks. -----Original Message----- From: Lisonbee, Todd [mailto:todd.lison...@intel.com] Sent: Wednesday, March 23, 2016 9:46 AM To: dev@flink.apache.org Subject: Aggregation Design Questions Hello, I'm working on adding Standard Deviation and others to the list of Aggregations, https://issues.apache.org/jira/browse/FLINK-3613 Unfortunately, I didn't get very far because the general design of Aggreation on DataSets needs to change and each solution seems to have drawbacks. For example, one easy solution would be to modify AggregateOperator to extend CustomUnaryOperation but that seems weird because then it wouldn't be an Operator. I wrote a design explaining some of the current limitations and background, https://issues.apache.org/jira/secure/attachment/12794820/DataSet-Aggregation-Design-March2016-v1.txt The design is in progress. I wanted to check in with people before going much further. I'd appreciate any feedback. Thanks, Todd