Re: Cross operation on two huge datasets

Fabian Hueske Thu, 23 Feb 2017 03:01:30 -0800

Hi,

Flink's batch DataSet API does already support (manual) theta-joins via the
CrossFunction. It combines each pair of records of two input data sets.
This is done by broadcasting (and hence replicating) one of the inputs.
@Xingcan, so I think what you describe is already there.
However, as I said before, it is often prohibitively expensive to compute.
When you are at a point, where a MapFunction with broadcast set is not
longer sufficient (the smaller data set does not fit into memory), you're
problem is often too big too compute.
The complexity of a Cartesian product (Cross) is simply quadratic.


Regarding the specific problem of joining spatial shapes and points, I
would go with a spatial partitioning as follows:
- Partition the space and compute for each shape into which partitions it
belongs (could be more than one).
- Do the same for the points (will be exactly one).
- Do a 1-n join on the partition ids + an additional check if the point is
actually in the shape.

The challenge here is to have partitions of similar size.

Cheers, Fabian

2017-02-23 5:59 GMT+01:00 Xingcan Cui <xingc...@gmail.com>:

> Hi all,
>
> @Gwen From the database's point of view, the only way to avoid Cartesian
> product in join is to use index, which exhibits as key grouping in Flink.
> However, it only supports many-to-one mapping now, i.e., a shape or a point
> can only be distributed to a single group. Only points and shapes belonging
> to the same group can be joined and that could reduce the inherent pair
> comparisons (compared with a Cartesian product). It's perfectly suitable
> for equi-join.
>
> @Fabian I saw this thread when I was just considering about theta-join
> (which will eventually be supported) in Flink. Since it's impossible to
> group (index) a dataset for an arbitrary theta-join, I think we may need
> some duplication mechanism here. For example, split a dataset into n parts
> and send the other dataset to all of these parts. This could be more useful
> in stream join. BTW, it seems that I've seen another thread discussing
> about this, but can not find it now. What do you think?
>
> Best,
> Xingcan
>
> On Thu, Feb 23, 2017 at 6:41 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Gwen,
>>
>> Flink usually performs a block nested loop join to cross two data sets.
>> This algorithm spills one input to disk and streams the other input. For
>> each input it fills a memory buffer and to perform the cross. Then the
>> buffer of the spilled input is refilled with spilled records and records
>> are again crossed. This is done until one iteration over the spill records
>> is done. Then the other buffer of the streamed input is filled with the
>> next records.
>>
>> You should be aware that cross is a super expensive operation, especially
>> if you evaluate a complex condition for each pair of records. So cross can
>> be easily too expensive to compute.
>> For such use cases it is usually better to apply a coarse-grained spatial
>> partitioning and do a key-based join on the partitions. Within each
>> partition you'd perform a cross.
>>
>> Best, Fabian
>>
>>
>> 2017-02-21 18:34 GMT+01:00 Gwenhael Pasquiers <
>> gwenhael.pasqui...@ericsson.com>:
>>
>>> Hi,
>>>
>>>
>>>
>>> I need (or at least I think I do) to do a cross operation between two
>>> huge datasets. One dataset is a list of points. The other one is a list of
>>> shapes (areas).
>>>
>>>
>>>
>>> I want to know, for each point, the areas (they might overlap so a point
>>> can be in multiple areas) it belongs to so I thought I’d “cross” my points
>>> and areas since I need to test each point against each area.
>>>
>>>
>>>
>>> I tried it and my job stucks seems to work for some seconds then, at
>>> some point, it stucks.
>>>
>>>
>>>
>>> I’m wondering if Flink, for cross operations, tries to load one of the
>>> two datasets into RAM or if it’s able to split the job in multiple
>>> iterations (even if it means reading one of the two datasets multiple
>>> times).
>>>
>>>
>>>
>>> Or maybe I’m going at it the wrong way, or missing some parameters, feel
>>> free to correct me J
>>>
>>>
>>>
>>> I’m using flink 1.0.1.
>>>
>>>
>>>
>>> Thanks in advance
>>>
>>>
>>>
>>> Gwen’
>>>
>>
>>
>

Re: Cross operation on two huge datasets

Reply via email to