Re: Implement customized Join for SparkSQL

Rishi Yadav Thu, 08 Jan 2015 14:54:59 -0800

Hi Kevin,

Say A has 10 ids, so you are pulling data from B's data source only for
these 10 ids?


What if you load A and B as separate schemaRDDs and then do join. Spark
will optimize the path anyway when action is fired .

On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin <[email protected]> wrote:

>  Hi, All
>
>
>
> Suppose I want to join two tables A and B as follows:
>
>
>
> Select * from A join B on A.id = B.id
>
>
>
> A is a file while B is a database which indexed by id and I wrapped it by
> Data source API.
>
> The desired join flow is:
>
> 1.       Generate A’s RDD[Row]
>
> 2.       Generate B’s RDD[Row] from A by using A’s id and B’s data source
> api to get row from the database
>
> 3.       Merge these two RDDs to the final RDD[Row]
>
>
>
> However it seems existing join strategy doesn’t support it?
>
>
>
> Any way to achieve it?
>
>
>
> Best Regards,
>
> Kevin.
>

Re: Implement customized Join for SparkSQL

Reply via email to