Hi again!

Spark works, Hive works, %sh works!

But when I try to use %pyspark^
%pyspark
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
people = sqlContext.read.format("orc").load("peoplePartitioned")
people.filter(people.age < 15).select("name").show()

 error comes:
Traceback (most recent call last):
 File "/tmp/zeppelin_pyspark.py", line 178, in <module>
   eval(compiledCode)
 File "<string>", line 1, in <module>
 File "/usr/spark/python/pyspark/sql/context.py", line 632, in read
   return DataFrameReader(self)
 File "/usr/spark/python/pyspark/sql/readwriter.py", line 49, in __init__
   self._jreader = sqlContext._ssql_ctx.read()
 File "/usr/spark/python/pyspark/sql/context.py", line 660, in _ssql_ctx
   "build/sbt assembly", e)
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and
run build/sbt assembly", Py4JJavaError(u'An error occurred while calling
None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o56))


Is there some specific name for sqlContext in %pyspark?

Or should I really rebuild Spark?

Best regards.


On Tue, Nov 24, 2015 at 10:51 PM, moon soo Lee <m...@apache.org> wrote:

> Really appreciate for trying.
>
> About HiveContext (sqlContext)
> Zeppelin creates sqlContext and inject it by default.
> So you don't need to create it manually.
>
> If there're multiple sqlContext (HiveContext) being created with Derby as
> metastore, then only first one works but all others will fail.
>
> Therefore, it would help
>  - make sure unnecessary Interpreter processes (ps -ef | grep
> RemoteInterpreterServer) are not remaining from previous Zeppelin execution.
>  - try not to create sqlContext manually
>
> Thanks,
> moon
>
> On Wed, Nov 25, 2015 at 3:32 AM tsh <t...@timshenkao.su> wrote:
>
>> Hi!
>> Couple days ago I tested Zeppelin on my laptop, Cloudera Hadoop in
>> pseudodistributed mode with Spark Standalone. I faced with
>> fasterxml.jackson problem. Eric Charles said that he had the similar
>> problem and advised to remove jackson-*.jar libraries from lib folder. So I
>> did it. I also coped with parameters in zeppelin-env.sh to make Zeppelin
>> work locally.
>>
>> On Monday, when I came to job, it became clear that configuration
>> parameters for local installation and real cluster installation vary
>> greatly. And I got this Thrift Transport Exception .
>> In 2 days, rebuilt Zeppelin several times, checked all parameters,
>> checked & changed my network.  At last, when I received your letter, I
>> checked MASTER variable. And I remembered those deleted *.jar files. I
>> thought that they are sections of the chain. I copied them back to lib
>> folder. And Spark began to work!
>> But Spark SQL doesn't work, DataFrames can't load & write ORC files. It
>> gives some HiveContext error connected to metastore_db (Derby).  Either
>> Hive itself (which is situated on the same edge node as Zeppelin) has its
>> own Derby metastore_db, or I should delete metastore_db from
>> $ZEPPELIN_HOME/bin. Should I?
>> The code is
>> %spark
>> import org.apache.spark.sql._
>> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>
>> Import is made. Then I get error.
>>
>>
>>
>>
>> On 11/24/2015 07:39 PM, moon soo Lee wrote:
>>
>> Basically, if SPARK_HOME/bin/spark-shell works, then export SPARK_HOME in
>> conf/zeppelin-env.sh and setting 'master' property in Interpreter menu on
>> Zeppelin GUI should be enough to make successful connection to Spark
>> standalone cluster.
>>
>> Do you see any new exception in your log file when you set 'master'
>> property in Interpreter menu on Zeppelin GUI and see 'Scheduler already
>> Terminated' error? If you can share, that would be helpful.
>>
>> Zeppelin does not use HiveThriftServer2 and does not need any other
>> dependency except for JVM to run, once it's been built.
>>
>>
>> Thanks,
>> moon
>>
>> On Tue, Nov 24, 2015 at 11:37 PM Timur Shenkao <t...@timshenkao.su> wrote:
>>
>>> One more question. What should be installed on server? What the
>>> dependencies of Zeppelin?
>>> Node.js, npm, bower? Scala?
>>>
>>> On Tue, Nov 24, 2015 at 5:34 PM, Timur Shenkao < <t...@timshenkao.su>
>>> t...@timshenkao.su> wrote:
>>>
>>> > I also checked Spark workers. There are no traces, folders, logs about
>>> > Zeppelin on them.
>>> > There are logs about Zeppelin on Spark Master server only where
>>> Zeppelin
>>> > is launched.
>>> >
>>> > For example, H2O creates logs on every worker in folders
>>> > /usr/spark/work/app-.....-... Is it correct?
>>> >
>>> > I also launched Thrift server via
>>> /usr/spark/sbin/start-thriftserver.sh on
>>> > Spark Master. Does Zeppelin use
>>> > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 ?
>>> >
>>> > For terminated scheduler, I got
>>> > INFO [2015-11-24 16:26:16,610] ({pool-1-thread-2}
>>> > SchedulerFactory.java[jobFinished]:138) - Job paragraph_1448346$
>>> > ERROR [2015-11-24 16:26:17,658] ({Thread-34}
>>> > JobProgressPoller.java[run]:57) - Can not get or update progress
>>> > org.apache.zeppelin.interpreter.InterpreterException:
>>> > org.apache.thrift.transport.TTransportException
>>> >         at
>>> >
>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:302)
>>> >         at
>>> >
>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:110)
>>> >         at
>>> > org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:174)
>>> >         at
>>> >
>>> org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:54)
>>> > Caused by: org.apache.thrift.transport.TTransportException
>>> >         at
>>> >
>>> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>>> >         at
>>> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>>> >         at
>>> >
>>> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
>>> >         at
>>> >
>>> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
>>> >         at
>>> >
>>> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
>>> >         at
>>> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
>>> >         at
>>> >
>>> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getProgress(RemoteInterpret$
>>> >         at
>>> >
>>> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getProgress(RemoteInterpreterSer$
>>> > INFO [2015-11-24 16:26:52,617] ({qtp982007015-52}
>>> > InterpreterRestApi.java[updateSetting]:104) - Update interprete$
>>> >  INFO [2015-11-24 16:27:56,319] ({qtp982007015-48}
>>> > InterpreterRestApi.java[restartSetting]:143) - Restart interpre$
>>> > ERROR [2015-11-24 16:28:09,603] ({qtp982007015-48}
>>> > NotebookServer.java[runParagraph]:661) - Exception from run
>>> > java.lang.RuntimeException: Scheduler already terminated
>>> >         at
>>> >
>>> org.apache.zeppelin.scheduler.RemoteScheduler.submit(RemoteScheduler.java:124)
>>> >         at org.apache.zeppelin.notebook.Note.run(Note.java:326)
>>> >         at
>>> >
>>> org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:659)
>>> >         at
>>> >
>>> org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)
>>> >         at
>>> >
>>> org.apache.zeppelin.socket.NotebookSocket.onMessage(NotebookSocket.java:56)
>>> >         at
>>> >
>>> org.eclipse.jetty.websocket.WebSocketConnectionRFC6455$WSFrameHandler.onFrame(WebSocketConnectionRFC645$
>>> >         at
>>> >
>>> org.eclipse.jetty.websocket.WebSocketParserRFC6455.parseNext(WebSocketParserRFC6455.java:349)
>>> >         at
>>> >
>>> org.eclipse.jetty.websocket.WebSocketConnectionRFC6455.handle(WebSocketConnectionRFC6455.java:225)
>>> >         at
>>> >
>>> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
>>> >         at
>>> >
>>> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
>>> >         at
>>> >
>>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>>> >         at
>>> >
>>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>>> >         at java.lang.Thread.run(Thread.java:745)
>>> > ERROR [2015-11-24 16:28:36,906] ({qtp982007015-50}
>>> > NotebookServer.java[runParagraph]:661) - Exception from run
>>> > java.lang.RuntimeException: Scheduler already terminated
>>> >         at
>>> >
>>> org.apache.zeppelin.scheduler.RemoteScheduler.submit(RemoteScheduler.java:124)
>>> >         at org.apache.zeppelin.notebook.Note.run(Note.java:326)
>>> >         at
>>> >
>>> org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:659)
>>> >         at
>>> >
>>> org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)
>>> >         at
>>> >
>>> org.apache.zeppelin.socket.NotebookSocket.onMessage(NotebookSocket.java:56)
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Nov 24, 2015 at 4:50 PM, Timur Shenkao <t...@timshenkao.su>
>>> wrote:
>>> >
>>> >> Hello!
>>> >>
>>> >> There is no Kerberos, no security in my cluster. It's in an internal
>>> >> network.
>>> >>
>>> >> Interpreters %hive and %sh work, I can create tables, drop, pwd, etc.
>>> So,
>>> >> the problem is in integration with Spark.
>>> >>
>>> >> In /usr/spark/conf/spark-env.sh I set / unset in turn MASTER =
>>> >> spark://localhost:7077,  MASTER = spark://192.168.58.10:7077, MASTER
>>> =
>>> >> spark://127.0.0.1:7077 on master node. On slaves I set / unset in
>>> turn
>>> >> MASTER = spark://192.168.58.10:7077 in different combinations.
>>> >>
>>> >> Zeppelin is installed on the same machine as Spark Master. So, in
>>> >> zeppelin-env.sh I set / unset MASTER = spark://localhost:7077,
>>> MASTER =
>>> >> spark://192.168.58.10:7077, MASTER = spark://127.0.0.1:7077
>>> >> Yes, I can connect to 192.168.58 and see URL spark://192.168.58:7077
>>> >> REST URL spark://192.168.58:6066 (cluster mode)
>>> >>
>>> >> Does TCP type influence? On my laptop, in pseudodistributed mode, all
>>> >> connections are IPv4 (tcp). There are IPv4 lines in /etc/hosts only.
>>> >> In cluster, Spark automatically, for unknown reasons, uses IPv6
>>> (tcp6).
>>> >> There are IPv6 lines in /etc/hosts.
>>> >> Right now, I try to make Spark use IPv4
>>> >>
>>> >> I switched Spark to IPv4 via -Djava.net.preferIPv4Stack=true
>>> >>
>>> >> It seems that Zeppelin uses / answers the following ports on Master
>>> >> server (ps axu | grep zeppelin;  then for each PID netstat -natp |
>>> grep
>>> >> ...):
>>> >> 41303
>>> >> 46971
>>> >> 59007
>>> >> 35781
>>> >> 53637
>>> >> 34860
>>> >> 59793
>>> >> 46971
>>> >> 50676
>>> >> 50677
>>> >>
>>> >> 44341
>>> >> 50805
>>> >> 50803
>>> >> 50802
>>> >>
>>> >> 60886
>>> >> 43345
>>> >> 48415
>>> >> 48417
>>> >> 10000
>>> >> 48416
>>> >>
>>> >> Best regards
>>> >>
>>> >> P.S. I inserted into zeppelin-env.sh and spark interpreter
>>> configuration
>>> >> in web UI precise address from Spark page: MASTER=spark://
>>> >> 192.168.58.10:7077.
>>> >> Earlier, I got Java error stacktrace in Web UI.  I BEGAN to receive
>>> >> "Scheduler already terminated"
>>> >>
>>> >> On Tue, Nov 24, 2015 at 12:56 PM, moon soo Lee <m...@apache.org>
>>> wrote:
>>> >>
>>> >>> Thanks for sharing the problem.
>>> >>>
>>> >>> Based on your log file, it looks like somehow your spark master
>>> address
>>> >>> is not well configured.
>>> >>>
>>> >>> Can you confirm that you have also set 'master' property in
>>> Interpreter
>>> >>> menu on GUI, at spark section?
>>> >>>
>>> >>> If it is not, you can connect Spark Master UI with your web browser
>>> and
>>> >>> see the first line, "Spark Master at spark://....". That value
>>> should be in
>>> >>> 'master' property in Interpreter menu on GUI, at spark section.
>>> >>>
>>> >>> Hope this helps
>>> >>>
>>> >>> Best,
>>> >>> moon
>>> >>>
>>> >>> On Tue, Nov 24, 2015 at 3:07 AM Timur Shenkao <t...@timshenkao.su>
>>> wrote:
>>> >>>
>>> >>>> Hi!
>>> >>>>
>>> >>>> New mistake comes: TTransportException.
>>> >>>> I use CentOS 6.7 + Spark 1.5.2 Standalone + Cloudera Hadoop 5.4.8 on
>>> >>>> the same cluster. I can't use Mesos or Spark on YARN.
>>> >>>> I built Zeppelin 0.6.0 so:
>>> >>>> mvn clean package  –DskipTests  -Pspark-1.5 -Phadoop-2.6 -Pyarn
>>> >>>> -Ppyspark -Pbuild-distr
>>> >>>>
>>> >>>> I constantly get errors like
>>> >>>> ERROR [2015-11-23 18:14:33,404] ({pool-1-thread-4}
>>> Job.java[run]:183) -
>>> >>>> Job failed
>>> >>>> org.apache.zeppelin.interpreter.InterpreterException:
>>> >>>> org.apache.thrift.transport.TTransportException
>>> >>>>     at
>>> >>>>
>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:237)
>>> >>>>
>>> >>>>
>>> >>>> or
>>> >>>>
>>> >>>> ERROR [2015-11-23 18:07:26,535] ({Thread-11}
>>> >>>> RemoteInterpreterEventPoller.java[run]:72) - Can't get
>>> >>>> RemoteInterpreterEvent
>>> >>>> org.apache.thrift.transport.TTransportException
>>> >>>>
>>> >>>> I changed several parameters in zeppelin-env.sh and in Spark
>>> configs.
>>> >>>> Whatever I do - these mistakes come. At the same time, when I use
>>> local
>>> >>>> Zeppelin with Hadoop in pseudodistributed mode + Spark Standalone
>>> (Master +
>>> >>>> workers on the same machine), everything works.
>>> >>>>
>>> >>>> What configurations (memory, network, CPU cores) should be in order
>>> to
>>> >>>> Zeppelin to work?
>>> >>>>
>>> >>>> I launch H2O on this cluster. And it works.
>>> >>>> Spark Master config:
>>> >>>> SPARK_MASTER_WEBUI_PORT=18080
>>> >>>> HADOOP_CONF_DIR=/etc/hadoop/conf
>>> >>>> SPARK_HOME=/usr/spark
>>> >>>>
>>> >>>> Spark Worker config:
>>> >>>>    export HADOOP_CONF_DIR=/etc/hadoop/conf
>>> >>>>    export MASTER=spark://192.168.58.10:7077
>>> >>>>    export SPARK_HOME=/usr/spark
>>> >>>>
>>> >>>>    SPARK_WORKER_INSTANCES=1
>>> >>>>    SPARK_WORKER_CORES=4
>>> >>>>    SPARK_WORKER_MEMORY=32G
>>> >>>>
>>> >>>>
>>> >>>> I apply Spark configs + zeppelin configs & logs for local mode   +
>>> >>>> zeppelin configs & logs when I defined IP address of Spark Master
>>> >>>> explicitly.
>>> >>>> Thank you.
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>
>>

Reply via email to