Hi Pieter,
a FlatMapFunction can only return values when the map() method is called.
However, in your use case, you would like to return values *after* the
function was called the last time. This is not possible with a
FlatMapFunction, because you cannot identify the last map() call.
The MapPartit
Hi Fabian,
I have a question regarding the first approach. Is there a benefit gained
from choosing a RichMapPartitionFunction over a RichMapFunction in this
case? I assume that each broadcasted dataset is sent only once to each task
manager?
If I would broadcast dataset B, then I could for each e
Hi Gabor, Fabian,
thank you for your suggestions. I am intending to scale up so that I'm sure
that both A and B won't fit in memory. I'll see if I can come up with a
nice way to partition the datasets but if that will take too much time I'll
just have to accept that it wont work on large datasets.
Hello,
Alternatively, if dataset B fits in memory, but dataset A doesn't,
then you can do it with broadcasting B to a RichMapPartitionFunction
on A:
In the open method of mapPartition, you sort B. Then, for each element
of A, you do a binary search in B, and look at the index found by the
binary s
The idea is to partition both datasets by range.
Assume your dataset A is [1,2,3,4,5,6] you create two partitions: p1:
[1,2,3] and p2: [4,5,6].
Each partition is given to a different instance of a MapPartition operator
(this is a bit tricky, because you cannot use broadcastSet. You could load
the c
Hi Fabian,
thanks for your tips!
Do you have some pointers for getting started with the 'tricky range
partitioning'? I am quite keen to get this working with large datasets ;-)
Cheers,
Pieter
2015-09-30 10:24 GMT+02:00 Fabian Hueske :
> Hi Pieter,
>
> cross is indeed too expensive for this ta
Hi Pieter,
cross is indeed too expensive for this task.
If dataset A fits into memory, you can do the following: Use a
RichMapPartitionFunction to process dataset B and add dataset A as a
broadcastSet. In the open method of mapPartition, you can load the
broadcasted set and sort it by a.propertyX
Good day everyone,
I am looking for a good way to do the following:
I have dataset A and dataset B, and for each element in dataset A I would
like to filter dataset B and obtain the size of the result. To say it short:
*for each element a in A -> B.filter( _ < a.propertyx).count*
Currently I am