...@ericsson.com]
Sent: vendredi 3 mars 2017 18:10
To: user@flink.apache.org
Subject: RE: Cross operation on two huge datasets
To answer Ankit,
It is a batch application.
Yes, I admit I did broadcasting by hand. I did it that way because the only
other way I found to “broadcast” a DataSet was to use
broadcasting it rather than sending its elements 1 by 1.
I’ll try to use it, I’ll take anything that will make my code cleaner !
From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com]
Sent: vendredi 3 mars 2017 17:55
To: user@flink.apache.org
Subject: RE: Cross operation on two huge datasets
n.com>>
Date: Thursday, March 2, 2017 at 7:28 AM
To: "user@flink.apache.org<mailto:user@flink.apache.org>"
mailto:user@flink.apache.org>>
Subject: RE: Cross operation on two huge datasets
I made it so that I don’t care where the next operator will be scheduled.
I confi
with only
1 parititon, not even on yarn).
From: Till Rohrmann [mailto:trohrm...@apache.org]
Sent: jeudi 2 mars 2017 15:40
To: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Cross operation on two huge datasets
Yes you’re right about the “split” and broadcasting.
Storing
– so take my
> suggestions with caution J
>
>
>
> Thanks
>
> Ankit
>
>
>
> *From: *Gwenhael Pasquiers
> *Date: *Thursday, March 2, 2017 at 7:28 AM
> *To: *"user@flink.apache.org"
> *Subject: *RE: Cross operation on two huge datasets
>
>
till only learning – so take my suggestions
with caution ☺
Thanks
Ankit
From: Gwenhael Pasquiers
Date: Thursday, March 2, 2017 at 7:28 AM
To: "user@flink.apache.org"
Subject: RE: Cross operation on two huge datasets
I made it so that I don’t care where the next operator will be
, not even on yarn).
From: Till Rohrmann [mailto:trohrm...@apache.org]
Sent: jeudi 2 mars 2017 15:40
To: user@flink.apache.org
Subject: Re: Cross operation on two huge datasets
Yes you’re right about the “split” and broadcasting.
Storing it in the JVM is not a good approach, since you don’t know
r broadcasting, wouldn’t broadcasting the variable cancel the efforts
> I did to “split” the dataset parsing over the nodes ?
>
>
>
>
>
> *From:* Till Rohrmann [mailto:trohrm...@apache.org]
> *Sent:* jeudi 2 mars 2017 14:42
>
> *To:* user@flink.apache.org
> *Subject:* Re:
nodes ?
From: Till Rohrmann [mailto:trohrm...@apache.org]
Sent: jeudi 2 mars 2017 14:42
To: user@flink.apache.org
Subject: Re: Cross operation on two huge datasets
Hi Gwenhael,
if you want to persist operator state, then you would have to persist it (e.g.
writing to a shared directory or
> I would recommend looking into a data structure called RTree that is
> designed specifically for this use case, i.e matching point to a region.
>
>
>
> Thanks
>
> Ankit
>
>
>
> *From: *Fabian Hueske
> *Date: *Wednesday, February 22, 2017 at 2:41 PM
> *To: *
> *Sub
To: user@flink.apache.org
Cc: Fabian Hueske
Subject: Re: Cross operation on two huge datasets
Hi Gwen,
I would recommend looking into a data structure called RTree that is designed
specifically for this use case, i.e matching point to a region.
Thanks
Ankit
From: Fabian Hueske mailto:fhue
Hi Gwen,
I would recommend looking into a data structure called RTree that is designed
specifically for this use case, i.e matching point to a region.
Thanks
Ankit
From: Fabian Hueske
Date: Wednesday, February 22, 2017 at 2:41 PM
To:
Subject: Re: Cross operation on two huge datasets
Hi Gwen
Hi,
@Gwen, sorry that I missed the cross function and showed you the wrong way.
@Fabian's answers are what I mean.
Considering that the cross function is so expensive, can we find a way to
restrict the broadcast. That is, if the groupBy function is a many-to-one
mapping, the cross function is an
Hi Gwen,
sorry I didn't read your answer, I was still writing mine when you sent
yours ;-)
Regarding your strategy, this is basically what Cross does:
It keeps on input partitioned and broadcasts (replicates) the other one. On
each partition, it combines the records of the partition of the first
Hi,
Flink's batch DataSet API does already support (manual) theta-joins via the
CrossFunction. It combines each pair of records of two input data sets.
This is done by broadcasting (and hence replicating) one of the inputs.
@Xingcan, so I think what you describe is already there.
However, as I sai
…
sorry for my rambling if I’m not clear.
B.R.
From: Xingcan Cui [mailto:xingc...@gmail.com]
Sent: jeudi 23 février 2017 06:00
To: user@flink.apache.org
Subject: Re: Cross operation on two huge datasets
Hi all,
@Gwen From the database's point of view, the only way to avoid Cartesian
pr
Hi all,
@Gwen From the database's point of view, the only way to avoid Cartesian
product in join is to use index, which exhibits as key grouping in Flink.
However, it only supports many-to-one mapping now, i.e., a shape or a point
can only be distributed to a single group. Only points and shapes b
Hi Gwen,
Flink usually performs a block nested loop join to cross two data sets.
This algorithm spills one input to disk and streams the other input. For
each input it fills a memory buffer and to perform the cross. Then the
buffer of the spilled input is refilled with spilled records and records
18 matches
Mail list logo