RE: Cross operation on two huge datasets

2017-03-03 Thread Gwenhael Pasquiers
...@ericsson.com] Sent: vendredi 3 mars 2017 18:10 To: user@flink.apache.org Subject: RE: Cross operation on two huge datasets To answer Ankit, It is a batch application. Yes, I admit I did broadcasting by hand. I did it that way because the only other way I found to “broadcast” a DataSet was to use

RE: Cross operation on two huge datasets

2017-03-03 Thread Gwenhael Pasquiers
broadcasting it rather than sending its elements 1 by 1. I’ll try to use it, I’ll take anything that will make my code cleaner ! From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com] Sent: vendredi 3 mars 2017 17:55 To: user@flink.apache.org Subject: RE: Cross operation on two huge datasets

RE: Cross operation on two huge datasets

2017-03-03 Thread Gwenhael Pasquiers
n.com>> Date: Thursday, March 2, 2017 at 7:28 AM To: "user@flink.apache.org<mailto:user@flink.apache.org>" mailto:user@flink.apache.org>> Subject: RE: Cross operation on two huge datasets I made it so that I don’t care where the next operator will be scheduled. I confi

RE: Cross operation on two huge datasets

2017-03-03 Thread Gwenhael Pasquiers
with only 1 parititon, not even on yarn). From: Till Rohrmann [mailto:trohrm...@apache.org] Sent: jeudi 2 mars 2017 15:40 To: user@flink.apache.org<mailto:user@flink.apache.org> Subject: Re: Cross operation on two huge datasets Yes you’re right about the “split” and broadcasting. Storing

Re: Cross operation on two huge datasets

2017-03-02 Thread Xingcan Cui
– so take my > suggestions with caution J > > > > Thanks > > Ankit > > > > *From: *Gwenhael Pasquiers > *Date: *Thursday, March 2, 2017 at 7:28 AM > *To: *"user@flink.apache.org" > *Subject: *RE: Cross operation on two huge datasets > >

Re: Cross operation on two huge datasets

2017-03-02 Thread Jain, Ankit
till only learning – so take my suggestions with caution ☺ Thanks Ankit From: Gwenhael Pasquiers Date: Thursday, March 2, 2017 at 7:28 AM To: "user@flink.apache.org" Subject: RE: Cross operation on two huge datasets I made it so that I don’t care where the next operator will be

RE: Cross operation on two huge datasets

2017-03-02 Thread Gwenhael Pasquiers
, not even on yarn). From: Till Rohrmann [mailto:trohrm...@apache.org] Sent: jeudi 2 mars 2017 15:40 To: user@flink.apache.org Subject: Re: Cross operation on two huge datasets Yes you’re right about the “split” and broadcasting. Storing it in the JVM is not a good approach, since you don’t know

Re: Cross operation on two huge datasets

2017-03-02 Thread Till Rohrmann
r broadcasting, wouldn’t broadcasting the variable cancel the efforts > I did to “split” the dataset parsing over the nodes ? > > > > > > *From:* Till Rohrmann [mailto:trohrm...@apache.org] > *Sent:* jeudi 2 mars 2017 14:42 > > *To:* user@flink.apache.org > *Subject:* Re:

RE: Cross operation on two huge datasets

2017-03-02 Thread Gwenhael Pasquiers
nodes ? From: Till Rohrmann [mailto:trohrm...@apache.org] Sent: jeudi 2 mars 2017 14:42 To: user@flink.apache.org Subject: Re: Cross operation on two huge datasets Hi Gwenhael, if you want to persist operator state, then you would have to persist it (e.g. writing to a shared directory or

Re: Cross operation on two huge datasets

2017-03-02 Thread Till Rohrmann
> I would recommend looking into a data structure called RTree that is > designed specifically for this use case, i.e matching point to a region. > > > > Thanks > > Ankit > > > > *From: *Fabian Hueske > *Date: *Wednesday, February 22, 2017 at 2:41 PM > *To: * > *Sub

RE: Cross operation on two huge datasets

2017-03-02 Thread Gwenhael Pasquiers
To: user@flink.apache.org Cc: Fabian Hueske Subject: Re: Cross operation on two huge datasets Hi Gwen, I would recommend looking into a data structure called RTree that is designed specifically for this use case, i.e matching point to a region. Thanks Ankit From: Fabian Hueske mailto:fhue

Re: Cross operation on two huge datasets

2017-02-23 Thread Jain, Ankit
Hi Gwen, I would recommend looking into a data structure called RTree that is designed specifically for this use case, i.e matching point to a region. Thanks Ankit From: Fabian Hueske Date: Wednesday, February 22, 2017 at 2:41 PM To: Subject: Re: Cross operation on two huge datasets Hi Gwen

Re: Cross operation on two huge datasets

2017-02-23 Thread Xingcan Cui
Hi, @Gwen, sorry that I missed the cross function and showed you the wrong way. @Fabian's answers are what I mean. Considering that the cross function is so expensive, can we find a way to restrict the broadcast. That is, if the groupBy function is a many-to-one mapping, the cross function is an

Re: Cross operation on two huge datasets

2017-02-23 Thread Fabian Hueske
Hi Gwen, sorry I didn't read your answer, I was still writing mine when you sent yours ;-) Regarding your strategy, this is basically what Cross does: It keeps on input partitioned and broadcasts (replicates) the other one. On each partition, it combines the records of the partition of the first

Re: Cross operation on two huge datasets

2017-02-23 Thread Fabian Hueske
Hi, Flink's batch DataSet API does already support (manual) theta-joins via the CrossFunction. It combines each pair of records of two input data sets. This is done by broadcasting (and hence replicating) one of the inputs. @Xingcan, so I think what you describe is already there. However, as I sai

RE: Cross operation on two huge datasets

2017-02-23 Thread Gwenhael Pasquiers
… sorry for my rambling if I’m not clear. B.R. From: Xingcan Cui [mailto:xingc...@gmail.com] Sent: jeudi 23 février 2017 06:00 To: user@flink.apache.org Subject: Re: Cross operation on two huge datasets Hi all, @Gwen From the database's point of view, the only way to avoid Cartesian pr

Re: Cross operation on two huge datasets

2017-02-22 Thread Xingcan Cui
Hi all, @Gwen From the database's point of view, the only way to avoid Cartesian product in join is to use index, which exhibits as key grouping in Flink. However, it only supports many-to-one mapping now, i.e., a shape or a point can only be distributed to a single group. Only points and shapes b

Re: Cross operation on two huge datasets

2017-02-22 Thread Fabian Hueske
Hi Gwen, Flink usually performs a block nested loop join to cross two data sets. This algorithm spills one input to disk and streams the other input. For each input it fills a memory buffer and to perform the cross. Then the buffer of the spilled input is refilled with spilled records and records