DataSet: partitionByHash without materializing/spilling the entire partition?

Urs Schoenenberger Tue, 05 Sep 2017 03:31:41 -0700

Hi all,

we have a DataSet pipeline which reads CSV input data and then
essentially does a combinable GroupReduce via first(n).


In our first iteration (readCsvFile -> groupBy(0) -> sortGroup(0) ->
first(n)), we got a jobgraph like this:

source --[Forward]--> combine --[Hash Partition on 0, Sort]--> reduce

This works, but we found the combine phase to be inefficient because not
enough combinable elements fit into a sorter. My idea was to
pre-partition the DataSet to increase the chance of combinable elements
(readCsvFile -> partitionBy(0) -> groupBy(0) ->  sortGroup(0) -> first(n)).

To my surprise, I found that this changed the job graph to

source --[Hash Partition on 0]--> partition(noop) --[Forward]--> combine
--[Hash Partition on 0, Sort]--> reduce

while materializing and spilling the entire partitions at the
partition(noop)-Operator!

Is there any way I can partition the data on the way from source to
combine without spilling? That is, can I get a job graph that looks like


source --[Hash Partition on 0]--> combine --[Hash Partition on 0,
Sort]--> reduce

instead?

Thanks,
Urs

-- 
Urs Schönenberger - urs.schoenenber...@tngtech.com

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082

DataSet: partitionByHash without materializing/spilling the entire partition?

Reply via email to