Re: For each element in a dataset, do something with another dataset

2015-10-05 Thread Fabian Hueske
Hi Pieter, a FlatMapFunction can only return values when the map() method is called. However, in your use case, you would like to return values *after* the function was called the last time. This is not possible with a FlatMapFunction, because you cannot identify the last map() call. The MapPartit

Re: For each element in a dataset, do something with another dataset

2015-10-05 Thread Pieter Hameete
Hi Fabian, I have a question regarding the first approach. Is there a benefit gained from choosing a RichMapPartitionFunction over a RichMapFunction in this case? I assume that each broadcasted dataset is sent only once to each task manager? If I would broadcast dataset B, then I could for each e

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Pieter Hameete
Hi Gabor, Fabian, thank you for your suggestions. I am intending to scale up so that I'm sure that both A and B won't fit in memory. I'll see if I can come up with a nice way to partition the datasets but if that will take too much time I'll just have to accept that it wont work on large datasets.

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Gábor Gévay
Hello, Alternatively, if dataset B fits in memory, but dataset A doesn't, then you can do it with broadcasting B to a RichMapPartitionFunction on A: In the open method of mapPartition, you sort B. Then, for each element of A, you do a binary search in B, and look at the index found by the binary s

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Fabian Hueske
The idea is to partition both datasets by range. Assume your dataset A is [1,2,3,4,5,6] you create two partitions: p1: [1,2,3] and p2: [4,5,6]. Each partition is given to a different instance of a MapPartition operator (this is a bit tricky, because you cannot use broadcastSet. You could load the c

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Pieter Hameete
Hi Fabian, thanks for your tips! Do you have some pointers for getting started with the 'tricky range partitioning'? I am quite keen to get this working with large datasets ;-) Cheers, Pieter 2015-09-30 10:24 GMT+02:00 Fabian Hueske : > Hi Pieter, > > cross is indeed too expensive for this ta

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Fabian Hueske
Hi Pieter, cross is indeed too expensive for this task. If dataset A fits into memory, you can do the following: Use a RichMapPartitionFunction to process dataset B and add dataset A as a broadcastSet. In the open method of mapPartition, you can load the broadcasted set and sort it by a.propertyX

For each element in a dataset, do something with another dataset

2015-09-29 Thread Pieter Hameete
Good day everyone, I am looking for a good way to do the following: I have dataset A and dataset B, and for each element in dataset A I would like to filter dataset B and obtain the size of the result. To say it short: *for each element a in A -> B.filter( _ < a.propertyx).count* Currently I am