Hi again,
I had a bug in my logic. It works as expected (which is perfect).
So maybe for others:
Problem:
- execute superstep-dependent UDFs on datasets which do not have access
to the iteration context
Solution:
- add dummy element to the working set (W) at the beginning of the step
function
- extract dummy from W using a filter function
- convert dummy into DataSet<Integer> (superstep) using a map function
- broadcast that 1-element dataset to the UDFs applied on the "external"
datasets
- filter non-dummy elements (if necessary) and continue step function
Note, that it should also work with cross instead of broadcasting, I did
not test which way works faster, yet.
Apologies if anyone thought about this when it was my error in the end :)
Cheers,
Martin
On 29.05.2016 14:05, Martin Junghanns wrote:
Hi everyone,
In a step-function (bulk) I'd like to join the working set W
with another data set T. The join field of T depends on
the current super step. Unfortunately, W has no access
to the iteration runtime context.
I tried to extract the current superstep at the beginning of
the step function and broadcasted it to a UDF applied on T
(which sets the correct value join field) and perform the join
always on the same fields. Unfortunately, this does not seem
to work either.
I could work around that by replicating the elements of T and
join multiple times but this does not scale very well.
Any suggestion would be appreciated.
Cheers and thank you,
Martin