I wrote another design for a summarize() function on DataSet.
https://issues.apache.org/jira/browse/FLINK-3664

I think this would be a better place for me to start than working on generic 
Aggregations.  (I could move ahead with it immediately and there are no tricky 
decisions if people more or less liked the design).

Any support for a summarize() function?

        // Summarize a DataSet of Tuples by collecting single pass statistics 
for all columns
        // example usage:

        Dataset<Tuple3<Double, String, Boolean>> input = // [...]
        Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> 
summary = input.summarize()
        summary.getField(0).stddev()
        summary.getField(1).maxStringLength()

Thanks.


-----Original Message-----
From: Lisonbee, Todd [mailto:todd.lison...@intel.com] 
Sent: Wednesday, March 23, 2016 9:46 AM
To: dev@flink.apache.org
Subject: Aggregation Design Questions

Hello,

I'm working on adding Standard Deviation and others to the list of Aggregations,
https://issues.apache.org/jira/browse/FLINK-3613

Unfortunately, I didn't get very far because the general design of Aggreation 
on DataSets needs to change and each solution seems to have drawbacks.  For 
example, one easy solution would be to modify AggregateOperator to extend 
CustomUnaryOperation but that seems weird because then it wouldn't be an 
Operator.

I wrote a design explaining some of the current limitations and background, 
https://issues.apache.org/jira/secure/attachment/12794820/DataSet-Aggregation-Design-March2016-v1.txt

The design is in progress.  I wanted to check in with people before going much 
further.

I'd appreciate any feedback.

Thanks,

Todd

Reply via email to