Okay, I'm really confused haha, some things are working, but I'm not sure
why..


So this works:

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS *test3*(date int, date_time
string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LOCATION
'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/'")

val results = hc.sql("select * from *test3*")

val schemaString = "x, y, z, a, value"

import org.apache.spark.sql._

val schema =StructType(schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))

val wikiSchemaRDD = sqlContext.applySchema(results, schema)

wikiSchemaRDD.registerTempTable("*test4*")


*z.show(sqlContext.sql("select * from test4 limit 1001"))*

Why does this last line work, but %sql select * from test 4 does not??.
Zeppelin is able to recognize test4, which was only registered as a temp
table, so why can I not use %sql on this?

Why can't I do *z.show(sqlContext.sql("select * from results limit 1001"))*

All I see as the difference between test4 and results is that test4 was
applied a schema (but, results is of type org.apache.spark.sql.Row (not
catalyst.Row)).

Sorry for all the questions, thanks for the answers!


On Tue, Jun 30, 2015 at 6:38 PM, Su She <suhsheka...@gmail.com> wrote:

> Thanks for the suggestion Moon, unfortunately, I got the Invocation
> exception:
>
> *sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(b int, w
> string, a string, x int, y int) ROW FORMAT DELIMITED FIELDS TERMINATED BY
> ',' LOCATION 'hdfs://ip:8020/user/flume/'")*
>
>
> *sqlContext.sql("select * from test3").registerTempTable("test1")*
>
>
> *%sql select * from test1*
>
> *java.lang.reflect.InvocationTargetException*
>
> To clarify, my data is in a csv format within the directory /user/flume/
>
> so:
>
> user/flume/csv1.csv
> user/flume/csv2.csv
>
> The reason I created HiveContext was because when I do:
>
> *hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(date int, date_time
> string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS
> TERMINATED BY ',' LOCATION 'hdfs://ip:8020/user/flume/'")*
>
> *val results = hc.sql("select * from test3")*
>
> *results.take(10) // I am able to get the results, but when I replace hc
> with sqlContext, I can't do results.take()*
>
> *This last line results an Array of rows*
>
> Please let me know if I am doing anything wrong, thanks!
>
>
>
>
> On Tue, Jun 30, 2015 at 11:07 AM, moon soo Lee <m...@apache.org> wrote:
>
>> Hi,
>>
>> Please try not create hc, sqlContext manually, and use zeppelin created
>> sqlContext. After run
>>
>> sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y
>> string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED
>> BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'")
>>
>> it supposed to access table 'test1' in your query.
>>
>> You can also do registerTempTable("test2"), for accessing table 'test2',
>> but  it supposed from valid dataframe. So not
>>
>> sqlContext("CREATE EXTERNAL TABLE .... ").registerTempTable("test2")
>>
>> but like this
>>
>> sqlContext("select * from test1").registerTempTable("test1")
>>
>> Tell me if it helps.
>>
>> Best,
>> moon
>>
>>
>> On Mon, Jun 29, 2015 at 12:05 PM Su She <suhsheka...@gmail.com> wrote:
>>
>>> Hey Moon/All,
>>>
>>> sorry for the late reply.
>>>
>>> This is the problem I'm encountering when trying to register Hive as a
>>> temptable. It seems that it cannot find a table, I have bolded this in the
>>> error message that I've c/p below. Please let me know if this is the best
>>> way for doing this. My end goal is to execute:
>>>
>>> *z.show(hc.sql("select * from test1"))*
>>>
>>> Thank you for the help!
>>>
>>> *//Code:*
>>> import sys.process._
>>> import org.apache.spark.sql.hive._
>>> val hc = new HiveContext(sc)
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>>
>>> hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string,
>>> time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
>>> LOCATION
>>> 'hdfs://.us-west-1.compute.internal:8020/user/flume/'").registerTempTable("test2")
>>>
>>> val results = hc.sql("select * from test2 limit 100") //have also tried
>>> test1
>>>
>>> *//everything works fine upto here, but due to lazy evaluation, i guess
>>> that doesn't mean much*
>>> results.map(t => "Name: " + t(0)).collect().foreach(println)
>>>
>>> results: org.apache.spark.sql.SchemaRDD = SchemaRDD[41] at RDD at
>>> SchemaRDD.scala:108 == Query Plan == == Physical Plan == Limit 100 !Project
>>> [result#105] NativeCommand CREATE EXTERNAL TABLE IF NOT EXISTS test1(date
>>> int, date_time string, time string, sensor int, value int) ROW FORMAT
>>> DELIMITED FIELDS TERMINATED BY ',' LOCATION
>>> 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/',
>>> [result#112] org.apache.spark.SparkException: Job aborted due to stage
>>> failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task
>>> 0.0 in stage 7.0 (TID 4, localhost):
>>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding
>>> attribute, tree: result#105 at
>>> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46)
>>> at
>>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>>> at
>>> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at
>>> scala.collection.AbstractTraversable.map(Traversable.scala:105) at
>>> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.<init>(Projection.scala:54)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105)
>>> at
>>> org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:44)
>>> at
>>> org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:43)
>>> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at
>>> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at
>>> org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at
>>> org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:56) at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745) *Caused by:
>>> java.lang.RuntimeException: Couldn't find result#105 in [result#112]*
>>> at scala.sys.package$.error(package.scala:27) at
>>> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:53)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:47)
>>> at
>>> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
>>> ... 33 more
>>>
>>> Thank you!
>>>
>>> On Thu, Jun 25, 2015 at 11:51 AM, moon soo Lee <m...@apache.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> Yes, %sql function is only for the tables that has been registered.
>>>> Using DataFrame is basically similar to what currently you're doing. It
>>>> needs registerTempTable.
>>>>
>>>> Could you share little bit about your problem when registering tables?
>>>>
>>>> And really appreciate for reporting a bug!
>>>>
>>>> Thanks,
>>>> moon
>>>>
>>>> On Wed, Jun 24, 2015 at 11:28 PM Corneau Damien <cornead...@apache.org>
>>>> wrote:
>>>>
>>>>> Yes, you can change the number of records. The default value is 1000
>>>>>
>>>>> On Thu, Jun 25, 2015 at 2:32 PM, Nihal Bhagchandani <
>>>>> nihal_bhagchand...@yahoo.com> wrote:
>>>>>
>>>>>> Hi Su,
>>>>>>
>>>>>> as per my understanding you can change the limit of 1000record from
>>>>>> the interpreter section by setting up the value for variable 
>>>>>> "zeppelin.spark.maxResult",
>>>>>> moon could you please confirm my understanding?
>>>>>>
>>>>>> Regards
>>>>>> Nihal
>>>>>>
>>>>>>
>>>>>>
>>>>>>   On Thursday, 25 June 2015 10:00 AM, Su She <suhsheka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> Excited to be making progress, and thanks for the community for
>>>>>> providing help along the way.This stuff is all really cool.
>>>>>>
>>>>>>
>>>>>> *Questions:*
>>>>>>
>>>>>> *1) *I noticed that the limit for the visual representation is 1000
>>>>>> results. Are there any short term plans to expand the limit? It seemed a
>>>>>> little on the low side as many of the reasons for working with 
>>>>>> spark/hadoop
>>>>>> is to work with large datasets.
>>>>>>
>>>>>> *2) *When can I use the %sql function? Is it only on tables that
>>>>>> have been registered? I have been having trouble registering tables 
>>>>>> unless
>>>>>> I do:
>>>>>>
>>>>>> // Apply the schema to the RDD.val peopleSchemaRDD = 
>>>>>> sqlContext.applySchema(rowRDD, schema)
>>>>>> // Register the SchemaRDD as a 
>>>>>> table.peopleSchemaRDD.registerTempTable("people")
>>>>>>
>>>>>>
>>>>>> I am having lots of trouble registering tables through HiveContext or
>>>>>> even duplicating the Zeppelin tutorial, is this issue mitigated by using
>>>>>> DataFrames ( I am planning to move to 1.3 very soon)?
>>>>>>
>>>>>>
>>>>>> *Bug:*
>>>>>>
>>>>>> When I do this:
>>>>>> z.show(sqlContext.sql("select * from sensortable limit 100"))
>>>>>>
>>>>>> I get the table, but I also get text results in the bottom, please
>>>>>> see attached image. For some reason, if the image doesn't go through, i
>>>>>> basically get the table, and everything works well, but the select
>>>>>> statement also returns text (regardless of its 100 results or all)
>>>>>>
>>>>>>
>>>>>> Thank you !
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Su
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>

Reply via email to