Hi all, we have a DataSet pipeline which reads CSV input data and then essentially does a combinable GroupReduce via first(n).
In our first iteration (readCsvFile -> groupBy(0) -> sortGroup(0) -> first(n)), we got a jobgraph like this: source --[Forward]--> combine --[Hash Partition on 0, Sort]--> reduce This works, but we found the combine phase to be inefficient because not enough combinable elements fit into a sorter. My idea was to pre-partition the DataSet to increase the chance of combinable elements (readCsvFile -> partitionBy(0) -> groupBy(0) -> sortGroup(0) -> first(n)). To my surprise, I found that this changed the job graph to source --[Hash Partition on 0]--> partition(noop) --[Forward]--> combine --[Hash Partition on 0, Sort]--> reduce while materializing and spilling the entire partitions at the partition(noop)-Operator! Is there any way I can partition the data on the way from source to combine without spilling? That is, can I get a job graph that looks like source --[Hash Partition on 0]--> combine --[Hash Partition on 0, Sort]--> reduce instead? Thanks, Urs -- Urs Schönenberger - urs.schoenenber...@tngtech.com TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller Sitz: Unterföhring * Amtsgericht München * HRB 135082