Re: Multiple joins in Spark

Shyam Parimal Katti Tue, 20 Oct 2015 06:24:57 -0700

When I do the steps above and run a query like this:

sqlContext.sql("select * from ...")


I get exception:

org.apache.spark.sql.AnalysisException: Non-local session path expected to
be non-null;
   at org.apache.spark.sql.hive.HiveQL$.createPlan(HiveQl.scala:260)
   .....

I cannot paste the entire stack since it's on a company laptop and I am not
allowed to copy paste things! Though if absolutely needed to help, I can
figure out some way to provide it.

On Sat, Oct 17, 2015 at 1:13 AM, Xiao Li <[email protected]> wrote:

> Hi, Shyam,
>
> The method registerTempTable is to register a [DataFrame as a temporary
> table in the Catalog using the given table name.
>
> In the Catalog, Spark maintains a concurrent hashmap, which contains the
> pair of the table names and the logical plan.
>
> For example, when we submit the following query,
>
> SELECT * FROM inMemoryDF
>
> The concurrent hashmap contains one map from name to the Logical Plan:
>
> "inMemoryDF" -> "LogicalRDD [c1#0,c2#1,c3#2,c4#3], MapPartitionsRDD[1] at
> createDataFrame at SimpleApp.scala:42
>
> Therefore, using SQL will not hurt your performance. The actual physical
> plan to execute your SQL query is generated by the result of Catalyst
> optimizer.
>
> Good luck,
>
> Xiao Li
>
>
>
> 2015-10-16 20:53 GMT-07:00 Shyam Parimal Katti <[email protected]>:
>
>> Thanks Xiao! Question about the internals, would you know what happens
>> when createTempTable() is called? I. E.  Does it create an RDD internally
>> or some internal representation that lets it join with  an RDD?
>>
>> Again, thanks for the answer.
>> On Oct 16, 2015 8:15 PM, "Xiao Li" <[email protected]> wrote:
>>
>>> Hi, Shyam,
>>>
>>> You still can use SQL to do the same thing in Spark:
>>>
>>> For example,
>>>
>>>     val df1 = sqlContext.createDataFrame(rdd)
>>>     val df2 = sqlContext.createDataFrame(rdd2)
>>>     val df3 = sqlContext.createDataFrame(rdd3)
>>>     df1.registerTempTable("tab1")
>>>     df2.registerTempTable("tab2")
>>>     df3.registerTempTable("tab3")
>>>     val exampleSQL = sqlContext.sql("select * from tab1, tab2, tab3
>>> where tab1.name = tab2.name and tab2.id = tab3.id")
>>>
>>> Good luck,
>>>
>>> Xiao Li
>>>
>>> 2015-10-16 17:01 GMT-07:00 Shyam Parimal Katti <[email protected]>:
>>>
>>>> Hello All,
>>>>
>>>> I have a following SQL query like this:
>>>>
>>>> select a.a_id, b.b_id, c.c_id from table_a a join table_b b on a.a_id =
>>>> b.a_id join table_c c on b.b_id = c.b_id
>>>>
>>>> In scala i have done this so far:
>>>>
>>>> table_a_rdd = sc.textFile(...)
>>>> table_b_rdd = sc.textFile(...)
>>>> table_c_rdd = sc.textFile(...)
>>>>
>>>> val table_a_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>>> (line(0), line))
>>>> val table_b_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>>> (line(0), line))
>>>> val table_c_rowRDD = table_a_rdd.map(_.split("\\x07")).map(line =>
>>>> (line(0), line))
>>>>
>>>> Each line has the first value at its primary key.
>>>>
>>>> While I can join 2 RDDs using table_a_rowRDD.join(table_b_rowRDD) to
>>>> join, is it possible to join multiple RDDs in a single expression? like
>>>> table_a_rowRDD.join(table_b_rowRDD).join(table_c_rowRDD) ? Also, how can I
>>>> specify the column on which I can join multiple RDDs?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>

Re: Multiple joins in Spark

Reply via email to