I wrote another design for a summarize() function on DataSet.
https://issues.apache.org/jira/browse/FLINK-3664
I think this would be a better place for me to start than working on generic
Aggregations. (I could move ahead with it immediately and there are no tricky
decisions if people more or less liked the design).
Any support for a summarize() function?
// Summarize a DataSet of Tuples by collecting single pass statistics
for all columns
// example usage:
Dataset<Tuple3<Double, String, Boolean>> input = // [...]
Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary>
summary = input.summarize()
summary.getField(0).stddev()
summary.getField(1).maxStringLength()
Thanks.
-----Original Message-----
From: Lisonbee, Todd [mailto:[email protected]]
Sent: Wednesday, March 23, 2016 9:46 AM
To: [email protected]
Subject: Aggregation Design Questions
Hello,
I'm working on adding Standard Deviation and others to the list of Aggregations,
https://issues.apache.org/jira/browse/FLINK-3613
Unfortunately, I didn't get very far because the general design of Aggreation
on DataSets needs to change and each solution seems to have drawbacks. For
example, one easy solution would be to modify AggregateOperator to extend
CustomUnaryOperation but that seems weird because then it wouldn't be an
Operator.
I wrote a design explaining some of the current limitations and background,
https://issues.apache.org/jira/secure/attachment/12794820/DataSet-Aggregation-Design-March2016-v1.txt
The design is in progress. I wanted to check in with people before going much
further.
I'd appreciate any feedback.
Thanks,
Todd