Hi,

can somebody please explain the way FullOuterJoin works on Spark? Does each 
intersection get fully loaded to memory?

My problem is as follows:


I have two large data-sets:


* a list of web pages,

* a list of domain-names with specific rules for processing pages from that 
domain.


I am joining these web-pages with processing rules.


For certain domains there are millions of web-pages.


Based on the memory demands the join is having it looks like the whole 
intersection (i.e. a domain + all corresponding pages) are kept in memory while 
processing.


What I really need in this case, though, is to hold just the domain and iterate 
over all corresponding pages, one at a time.


What would be the best way to do this on Spark?

Thank you,

Dusan Rychnovsky

Reply via email to