Can your domain list fit in memory of one executor. if so you can use
broadcast join.

You can always narrow down to inner join and derive rest from original set
if memory is issue there. If you are just concerned about shuffle memory
then to reduce amount of shuffle you can do following:
1) partition both rdd (dataframes) with same partitioner with same count so
corresponding data will on on same node at least
2) increase shuffle.memoryfraction

you can use dataframes with spark 1.6 or greater to further reduce memory
footprint. I haven't tested that though.


On Tue, Jun 21, 2016 at 6:16 AM, Rychnovsky, Dusan <
dusan.rychnov...@firma.seznam.cz> wrote:

> Hi,
>
>
> can somebody please explain the way FullOuterJoin works on Spark? Does
> each intersection get fully loaded to memory?
>
> My problem is as follows:
>
>
> I have two large data-sets:
>
>
> * a list of web pages,
>
> * a list of domain-names with specific rules for processing pages from
> that domain.
>
>
> I am joining these web-pages with processing rules.
>
>
> For certain domains there are millions of web-pages.
>
>
> Based on the memory demands the join is having it looks like the whole
> intersection (i.e. a domain + all corresponding pages) are kept in memory
> while processing.
>
>
> What I really need in this case, though, is to hold just the domain and
> iterate over all corresponding pages, one at a time.
>
>
> What would be the best way to do this on Spark?
>
> Thank you,
>
> Dusan Rychnovsky
>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Reply via email to