Re: Notebook note with %sql does not see registered tables in previous note

Eugene Wed, 09 Sep 2015 06:47:17 -0700

Found solution!
Instead of using 'val sqlContext = new org.apache.spark.sql.SQLContext(sc)'
I need to use 'sqlContext' (provided by Zeppelin)


Based on
http://stackoverflow.com/questions/30819075/zeppelin-sqlcontext-registertemptable-issue

2015-09-09 16:40 GMT+04:00 Eugene <[email protected]>:

> I use spark 1.4.1 on EMR cluster with YARN
>
> Sharing my code:
>
> PARAGRAPH #1
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import scala.io.Source
> @transient val suffixesStr = "hello\nworld"
> @transient val suffList = suffixesStr.lines.filter(line =>
> !line.startsWith("//") && line.trim() != "")
> val suffListRDD = sc.parallelize(suffList.toList)
> val schemaString = "suffix"
> val schemaItemsCount = schemaString.count(_ == ' ') + 1
> // filtering to get only valid rows
> val eventsRDD = suffListRDD.filter(x => "\\t".r.findAllIn(x).length + 1 ==
> schemaItemsCount)
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.types.{StructType,StructField,StringType};
> val schema =StructType(schemaString.split(" ").map(fieldName =>
> StructField(fieldName, StringType, true)))
> // Convert records of the RDD (eventsRDD) to Rows.
> // split with preserve tokens
> val rowRDD = eventsRDD.map(_.split("\\t", -1)).map(p => Row(p(0)))
> val eventsDF = sqlContext.createDataFrame(rowRDD, schema)
> eventsDF.registerTempTable("events")
> // Uncomment these if you like to verify the logic in same paragraph
> //val cnt = sqlContext.sql("SELECT count(*) from events")
> //cnt.take(10).foreach(println)
> ----------------OUTPUT---------------
> sqlContext: org.apache.spark.sql.SQLContext =
> org.apache.spark.sql.SQLContext@36e407de import scala.io.Source
> suffixesStr: String = hello world suffList: Iterator[String] = non-empty
> iterator suffListRDD: org.apache.spark.rdd.RDD[String] =
> ParallelCollectionRDD[3] at parallelize at <console>:28 schemaString:
> String = suffix schemaItemsCount: Int = 1 eventsRDD:
> org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at
> <console>:36 import org.apache.spark.sql.Row import
> org.apache.spark.sql.types.{StructType, StructField, StringType} schema:
> org.apache.spark.sql.types.StructType =
> StructType(StructField(suffix,StringType,true)) rowRDD:
> org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[6] at
> map at <console>:41 eventsDF: org.apache.spark.sql.DataFrame = [suffix:
> string]
>
> PARAGRAPH #2
> %sql SELECT count(*) from events
> ----------------OUTPUT---------------
> org.apache.spark.sql.AnalysisException: no such table events; line 1 pos
> 21 at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>
>
>
> As you asked, I also checked the bank example. It works just fine:
>
> PARAGRAPH #1
> @transient val suffixesStr = "hello;world" @transient val suffList =
> suffixesStr.lines.filter(line => !line.startsWith("//") && line.trim() !=
> "") val bankText = sc.parallelize(suffList.toList) case class
> Bank(f1:String, f2:String) val bank =
> bankText.map(s=>s.split(";")).map(s=>Bank(s(0),s(1)))
> bank.toDF().registerTempTable("bank")
> ----------------OUTPUT---------------
> suffixesStr: String = hello;world suffList: Iterator[String] = non-empty
> iterator bankText: org.apache.spark.rdd.RDD[String] =
> ParallelCollectionRDD[11] at parallelize at <console>:30 defined class Bank
> bank: org.apache.spark.rdd.RDD[Bank] = MapPartitionsRDD[13] at map at
> <console>:34
>
> PARAGRAPH #2
> %sql select count(*) from bank
> ----------------OUTPUT---------------
> 1
>
>
> What could be the root cause?
>
> 2015-09-09 15:50 GMT+04:00 IT CTO <[email protected]>:
>
>> I just tested it on my code and had not problem doing so.
>> Can you share the code before the line eventsDF.registerTempTable("
>> events")?
>> Can you run the Zeppelin Load Data Into Table paragraph in the Zeppelin
>> Tutorial and then try to run the select from bank on another note and see
>> if it works?
>>
>> Generally speaking since they both run in the same JVM and share the same
>> spark context you shouldn't have problem doing what you described.
>>
>> Thanks,
>> Eran
>>
>> On Wed, Sep 9, 2015 at 2:30 PM Eugene <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> In promo video tutorial for Zeppelin (
>>> http://www.youtube.com/watch?v=_PQbVH_aO5E&feature=youtu.be) I see
>>> demonstration of how to declare table 'bank' in one note and query this
>>> table in another note.
>>> I tried doing this and it did not work.
>>>
>>> I have one notebook with two notes:
>>>
>>> Note #1: // basically all data definition for second note
>>> eventsDF.registerTempTable("events")
>>>
>>> Note #2: // querying defined table
>>> %sql select * from events limit 10
>>>
>>>
>>> Note #1 gets executed okay, Note #2 fails
>>> org.apache.spark.sql.AnalysisException: no such table events;
>>>
>>> How should I proceed to make my table be available for Note #2?
>>>
>>> Thanks!
>>>
>>>
>>> --
>>>
>>>
>>> Best regards,
>>> Eugene.
>>>
>> --
>> Eran | "You don't need eyes to see, you need vision" (Faithless)
>>
>
>
>
> --
>
>
> Best regards,
> Eugene.
>



-- 


Best regards,
Eugene.

Re: Notebook note with %sql does not see registered tables in previous note

Reply via email to