Hi again! Spark works, Hive works, %sh works!
But when I try to use %pyspark^ %pyspark sqlContext.setConf("spark.sql.orc.filterPushdown", "true") people = sqlContext.read.format("orc").load("peoplePartitioned") people.filter(people.age < 15).select("name").show() error comes: Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py", line 178, in <module> eval(compiledCode) File "<string>", line 1, in <module> File "/usr/spark/python/pyspark/sql/context.py", line 632, in read return DataFrameReader(self) File "/usr/spark/python/pyspark/sql/readwriter.py", line 49, in __init__ self._jreader = sqlContext._ssql_ctx.read() File "/usr/spark/python/pyspark/sql/context.py", line 660, in _ssql_ctx "build/sbt assembly", e) Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o56)) Is there some specific name for sqlContext in %pyspark? Or should I really rebuild Spark? Best regards. On Tue, Nov 24, 2015 at 10:51 PM, moon soo Lee <m...@apache.org> wrote: > Really appreciate for trying. > > About HiveContext (sqlContext) > Zeppelin creates sqlContext and inject it by default. > So you don't need to create it manually. > > If there're multiple sqlContext (HiveContext) being created with Derby as > metastore, then only first one works but all others will fail. > > Therefore, it would help > - make sure unnecessary Interpreter processes (ps -ef | grep > RemoteInterpreterServer) are not remaining from previous Zeppelin execution. > - try not to create sqlContext manually > > Thanks, > moon > > On Wed, Nov 25, 2015 at 3:32 AM tsh <t...@timshenkao.su> wrote: > >> Hi! >> Couple days ago I tested Zeppelin on my laptop, Cloudera Hadoop in >> pseudodistributed mode with Spark Standalone. I faced with >> fasterxml.jackson problem. Eric Charles said that he had the similar >> problem and advised to remove jackson-*.jar libraries from lib folder. So I >> did it. I also coped with parameters in zeppelin-env.sh to make Zeppelin >> work locally. >> >> On Monday, when I came to job, it became clear that configuration >> parameters for local installation and real cluster installation vary >> greatly. And I got this Thrift Transport Exception . >> In 2 days, rebuilt Zeppelin several times, checked all parameters, >> checked & changed my network. At last, when I received your letter, I >> checked MASTER variable. And I remembered those deleted *.jar files. I >> thought that they are sections of the chain. I copied them back to lib >> folder. And Spark began to work! >> But Spark SQL doesn't work, DataFrames can't load & write ORC files. It >> gives some HiveContext error connected to metastore_db (Derby). Either >> Hive itself (which is situated on the same edge node as Zeppelin) has its >> own Derby metastore_db, or I should delete metastore_db from >> $ZEPPELIN_HOME/bin. Should I? >> The code is >> %spark >> import org.apache.spark.sql._ >> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) >> >> Import is made. Then I get error. >> >> >> >> >> On 11/24/2015 07:39 PM, moon soo Lee wrote: >> >> Basically, if SPARK_HOME/bin/spark-shell works, then export SPARK_HOME in >> conf/zeppelin-env.sh and setting 'master' property in Interpreter menu on >> Zeppelin GUI should be enough to make successful connection to Spark >> standalone cluster. >> >> Do you see any new exception in your log file when you set 'master' >> property in Interpreter menu on Zeppelin GUI and see 'Scheduler already >> Terminated' error? If you can share, that would be helpful. >> >> Zeppelin does not use HiveThriftServer2 and does not need any other >> dependency except for JVM to run, once it's been built. >> >> >> Thanks, >> moon >> >> On Tue, Nov 24, 2015 at 11:37 PM Timur Shenkao <t...@timshenkao.su> wrote: >> >>> One more question. What should be installed on server? What the >>> dependencies of Zeppelin? >>> Node.js, npm, bower? Scala? >>> >>> On Tue, Nov 24, 2015 at 5:34 PM, Timur Shenkao < <t...@timshenkao.su> >>> t...@timshenkao.su> wrote: >>> >>> > I also checked Spark workers. There are no traces, folders, logs about >>> > Zeppelin on them. >>> > There are logs about Zeppelin on Spark Master server only where >>> Zeppelin >>> > is launched. >>> > >>> > For example, H2O creates logs on every worker in folders >>> > /usr/spark/work/app-.....-... Is it correct? >>> > >>> > I also launched Thrift server via >>> /usr/spark/sbin/start-thriftserver.sh on >>> > Spark Master. Does Zeppelin use >>> > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 ? >>> > >>> > For terminated scheduler, I got >>> > INFO [2015-11-24 16:26:16,610] ({pool-1-thread-2} >>> > SchedulerFactory.java[jobFinished]:138) - Job paragraph_1448346$ >>> > ERROR [2015-11-24 16:26:17,658] ({Thread-34} >>> > JobProgressPoller.java[run]:57) - Can not get or update progress >>> > org.apache.zeppelin.interpreter.InterpreterException: >>> > org.apache.thrift.transport.TTransportException >>> > at >>> > >>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:302) >>> > at >>> > >>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:110) >>> > at >>> > org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:174) >>> > at >>> > >>> org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:54) >>> > Caused by: org.apache.thrift.transport.TTransportException >>> > at >>> > >>> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) >>> > at >>> > org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) >>> > at >>> > >>> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) >>> > at >>> > >>> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) >>> > at >>> > >>> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) >>> > at >>> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) >>> > at >>> > >>> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getProgress(RemoteInterpret$ >>> > at >>> > >>> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getProgress(RemoteInterpreterSer$ >>> > INFO [2015-11-24 16:26:52,617] ({qtp982007015-52} >>> > InterpreterRestApi.java[updateSetting]:104) - Update interprete$ >>> > INFO [2015-11-24 16:27:56,319] ({qtp982007015-48} >>> > InterpreterRestApi.java[restartSetting]:143) - Restart interpre$ >>> > ERROR [2015-11-24 16:28:09,603] ({qtp982007015-48} >>> > NotebookServer.java[runParagraph]:661) - Exception from run >>> > java.lang.RuntimeException: Scheduler already terminated >>> > at >>> > >>> org.apache.zeppelin.scheduler.RemoteScheduler.submit(RemoteScheduler.java:124) >>> > at org.apache.zeppelin.notebook.Note.run(Note.java:326) >>> > at >>> > >>> org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:659) >>> > at >>> > >>> org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126) >>> > at >>> > >>> org.apache.zeppelin.socket.NotebookSocket.onMessage(NotebookSocket.java:56) >>> > at >>> > >>> org.eclipse.jetty.websocket.WebSocketConnectionRFC6455$WSFrameHandler.onFrame(WebSocketConnectionRFC645$ >>> > at >>> > >>> org.eclipse.jetty.websocket.WebSocketParserRFC6455.parseNext(WebSocketParserRFC6455.java:349) >>> > at >>> > >>> org.eclipse.jetty.websocket.WebSocketConnectionRFC6455.handle(WebSocketConnectionRFC6455.java:225) >>> > at >>> > >>> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) >>> > at >>> > >>> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) >>> > at >>> > >>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) >>> > at >>> > >>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) >>> > at java.lang.Thread.run(Thread.java:745) >>> > ERROR [2015-11-24 16:28:36,906] ({qtp982007015-50} >>> > NotebookServer.java[runParagraph]:661) - Exception from run >>> > java.lang.RuntimeException: Scheduler already terminated >>> > at >>> > >>> org.apache.zeppelin.scheduler.RemoteScheduler.submit(RemoteScheduler.java:124) >>> > at org.apache.zeppelin.notebook.Note.run(Note.java:326) >>> > at >>> > >>> org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:659) >>> > at >>> > >>> org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126) >>> > at >>> > >>> org.apache.zeppelin.socket.NotebookSocket.onMessage(NotebookSocket.java:56) >>> > >>> > >>> > >>> > >>> > On Tue, Nov 24, 2015 at 4:50 PM, Timur Shenkao <t...@timshenkao.su> >>> wrote: >>> > >>> >> Hello! >>> >> >>> >> There is no Kerberos, no security in my cluster. It's in an internal >>> >> network. >>> >> >>> >> Interpreters %hive and %sh work, I can create tables, drop, pwd, etc. >>> So, >>> >> the problem is in integration with Spark. >>> >> >>> >> In /usr/spark/conf/spark-env.sh I set / unset in turn MASTER = >>> >> spark://localhost:7077, MASTER = spark://192.168.58.10:7077, MASTER >>> = >>> >> spark://127.0.0.1:7077 on master node. On slaves I set / unset in >>> turn >>> >> MASTER = spark://192.168.58.10:7077 in different combinations. >>> >> >>> >> Zeppelin is installed on the same machine as Spark Master. So, in >>> >> zeppelin-env.sh I set / unset MASTER = spark://localhost:7077, >>> MASTER = >>> >> spark://192.168.58.10:7077, MASTER = spark://127.0.0.1:7077 >>> >> Yes, I can connect to 192.168.58 and see URL spark://192.168.58:7077 >>> >> REST URL spark://192.168.58:6066 (cluster mode) >>> >> >>> >> Does TCP type influence? On my laptop, in pseudodistributed mode, all >>> >> connections are IPv4 (tcp). There are IPv4 lines in /etc/hosts only. >>> >> In cluster, Spark automatically, for unknown reasons, uses IPv6 >>> (tcp6). >>> >> There are IPv6 lines in /etc/hosts. >>> >> Right now, I try to make Spark use IPv4 >>> >> >>> >> I switched Spark to IPv4 via -Djava.net.preferIPv4Stack=true >>> >> >>> >> It seems that Zeppelin uses / answers the following ports on Master >>> >> server (ps axu | grep zeppelin; then for each PID netstat -natp | >>> grep >>> >> ...): >>> >> 41303 >>> >> 46971 >>> >> 59007 >>> >> 35781 >>> >> 53637 >>> >> 34860 >>> >> 59793 >>> >> 46971 >>> >> 50676 >>> >> 50677 >>> >> >>> >> 44341 >>> >> 50805 >>> >> 50803 >>> >> 50802 >>> >> >>> >> 60886 >>> >> 43345 >>> >> 48415 >>> >> 48417 >>> >> 10000 >>> >> 48416 >>> >> >>> >> Best regards >>> >> >>> >> P.S. I inserted into zeppelin-env.sh and spark interpreter >>> configuration >>> >> in web UI precise address from Spark page: MASTER=spark:// >>> >> 192.168.58.10:7077. >>> >> Earlier, I got Java error stacktrace in Web UI. I BEGAN to receive >>> >> "Scheduler already terminated" >>> >> >>> >> On Tue, Nov 24, 2015 at 12:56 PM, moon soo Lee <m...@apache.org> >>> wrote: >>> >> >>> >>> Thanks for sharing the problem. >>> >>> >>> >>> Based on your log file, it looks like somehow your spark master >>> address >>> >>> is not well configured. >>> >>> >>> >>> Can you confirm that you have also set 'master' property in >>> Interpreter >>> >>> menu on GUI, at spark section? >>> >>> >>> >>> If it is not, you can connect Spark Master UI with your web browser >>> and >>> >>> see the first line, "Spark Master at spark://....". That value >>> should be in >>> >>> 'master' property in Interpreter menu on GUI, at spark section. >>> >>> >>> >>> Hope this helps >>> >>> >>> >>> Best, >>> >>> moon >>> >>> >>> >>> On Tue, Nov 24, 2015 at 3:07 AM Timur Shenkao <t...@timshenkao.su> >>> wrote: >>> >>> >>> >>>> Hi! >>> >>>> >>> >>>> New mistake comes: TTransportException. >>> >>>> I use CentOS 6.7 + Spark 1.5.2 Standalone + Cloudera Hadoop 5.4.8 on >>> >>>> the same cluster. I can't use Mesos or Spark on YARN. >>> >>>> I built Zeppelin 0.6.0 so: >>> >>>> mvn clean package –DskipTests -Pspark-1.5 -Phadoop-2.6 -Pyarn >>> >>>> -Ppyspark -Pbuild-distr >>> >>>> >>> >>>> I constantly get errors like >>> >>>> ERROR [2015-11-23 18:14:33,404] ({pool-1-thread-4} >>> Job.java[run]:183) - >>> >>>> Job failed >>> >>>> org.apache.zeppelin.interpreter.InterpreterException: >>> >>>> org.apache.thrift.transport.TTransportException >>> >>>> at >>> >>>> >>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:237) >>> >>>> >>> >>>> >>> >>>> or >>> >>>> >>> >>>> ERROR [2015-11-23 18:07:26,535] ({Thread-11} >>> >>>> RemoteInterpreterEventPoller.java[run]:72) - Can't get >>> >>>> RemoteInterpreterEvent >>> >>>> org.apache.thrift.transport.TTransportException >>> >>>> >>> >>>> I changed several parameters in zeppelin-env.sh and in Spark >>> configs. >>> >>>> Whatever I do - these mistakes come. At the same time, when I use >>> local >>> >>>> Zeppelin with Hadoop in pseudodistributed mode + Spark Standalone >>> (Master + >>> >>>> workers on the same machine), everything works. >>> >>>> >>> >>>> What configurations (memory, network, CPU cores) should be in order >>> to >>> >>>> Zeppelin to work? >>> >>>> >>> >>>> I launch H2O on this cluster. And it works. >>> >>>> Spark Master config: >>> >>>> SPARK_MASTER_WEBUI_PORT=18080 >>> >>>> HADOOP_CONF_DIR=/etc/hadoop/conf >>> >>>> SPARK_HOME=/usr/spark >>> >>>> >>> >>>> Spark Worker config: >>> >>>> export HADOOP_CONF_DIR=/etc/hadoop/conf >>> >>>> export MASTER=spark://192.168.58.10:7077 >>> >>>> export SPARK_HOME=/usr/spark >>> >>>> >>> >>>> SPARK_WORKER_INSTANCES=1 >>> >>>> SPARK_WORKER_CORES=4 >>> >>>> SPARK_WORKER_MEMORY=32G >>> >>>> >>> >>>> >>> >>>> I apply Spark configs + zeppelin configs & logs for local mode + >>> >>>> zeppelin configs & logs when I defined IP address of Spark Master >>> >>>> explicitly. >>> >>>> Thank you. >>> >>>> >>> >>> >>> >> >>> > >>> >> >>