Re: Spark joins using row id

2016-11-13 Thread Yan Facai
pairRDD can use (hash) partition information to do some optimizations when joined, while I am not sure if dataset could. On Sat, Nov 12, 2016 at 7:11 PM, Rohit Verma wrote: > For datasets structured as > > ds1 > rowN col1 > 1 A > 2 B > 3 C > 4 C > … > > and > > ds2 > rowN

Re: Spark joins using row id

2016-11-12 Thread Rohit Verma
Result of explain is as follows *BroadcastHashJoin [rowN#0], [rowN#39], LeftOuter, BuildRight :- *Project [rowN#0, informer_code#22] : +- Window [rownumber() windowspecdefinition(informer_code#22 ASC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rowN#0], [informer_code#22 ASC] : +- *

Spark joins using row id

2016-11-12 Thread Rohit Verma
For datasets structured as ds1 rowN col1 1 A 2 B 3 C 4 C … and ds2 rowN col2 1 X 2 Y 3 Z … I want to do a left join Dataset joined = ds1.join(ds2,”rowN”,”left outer”); I somewhere read in SO or this mailing list that if spark is aware of datasets b