pairRDD can use (hash) partition information to do some optimizations when
joined, while I am not sure if dataset could.
On Sat, Nov 12, 2016 at 7:11 PM, Rohit Verma
wrote:
> For datasets structured as
>
> ds1
> rowN col1
> 1 A
> 2 B
> 3 C
> 4 C
> …
>
> and
>
> ds2
> rowN
Result of explain is as follows
*BroadcastHashJoin [rowN#0], [rowN#39], LeftOuter, BuildRight
:- *Project [rowN#0, informer_code#22]
: +- Window [rownumber() windowspecdefinition(informer_code#22 ASC, ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rowN#0], [informer_code#22 ASC]
: +- *
For datasets structured as
ds1
rowN col1
1 A
2 B
3 C
4 C
…
and
ds2
rowN col2
1 X
2 Y
3 Z
…
I want to do a left join
Dataset joined = ds1.join(ds2,”rowN”,”left outer”);
I somewhere read in SO or this mailing list that if spark is aware of datasets
b