[ https://issues.apache.org/jira/browse/FLINK-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15062207#comment-15062207 ]
ASF GitHub Bot commented on FLINK-2716: --------------------------------------- Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/1462#discussion_r47920383 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/DataSet.java --- @@ -394,6 +396,21 @@ public long count() throws Exception { return res.<Long> getAccumulatorResult(id); } + /** + * Convenience method to get the count (number of elements) of a DataSet + * as well as the checksum (sum over element hashes). + * + * @return A Checksum that represents the count and checksum of elements in the data set. + */ + public Checksum checksum() throws Exception { + final String id = new AbstractID().toString(); + + flatMap(new Utils.ChecksumHelper<T>(id)).name("checksum()") + .output(new DiscardingOutputFormat<NullValue>()).name("checksum() sink"); --- End diff -- This trick to have a `flatMap()` function and then a discarding sink comes from earlier versions of Flink, where Sinks could not use Accumulators. Since RichSinkFunctions can use accumulators, it would be good to actually have the Checksum helper as a sink. > Checksum method for DataSet and Graph > ------------------------------------- > > Key: FLINK-2716 > URL: https://issues.apache.org/jira/browse/FLINK-2716 > Project: Flink > Issue Type: Improvement > Components: Gelly, Java API, Scala API > Affects Versions: 0.10.0 > Reporter: Greg Hogan > Assignee: Greg Hogan > Priority: Minor > > {{DataSet.count()}}, {{Graph.numberOfVertices()}}, and > {{Graph.numberOfEdges()}} provide measures of the number of distributed data > elements. New {{DataSet.checksum()}} and {{Graph.checksum()}} methods will > summarize the content of data elements and support algorithm validation, > integration testing, and benchmarking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)