Just spark code works fine: val linesText = sc.textFile("hdfs://my/file/onhdfs.txt")
case class Line(id:Long, firstField:String, secondField:String) val lines = linesText.map{ line => val splitted = line.split(" ") println("splitted => " + splitted) Line(splitted(0).toLong, splitted(1), splitted(2)) } lines.collect().foreach(println) prints file contexts to UI. I have some trouble with sql... 2017-05-02 13:57 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>: > Here is my sample notebook: > %spark > val linesText = sc.textFile("hdfs://cluster/user/me/lines.txt") > > case class Line(id:Long, firstField:String, secondField:String) > > val lines = linesText.map{ line => > val splitted = line.split(" ") > println("splitted => " + splitted) > Line(splitted(0).toLong, splitted(1), splitted(2)) > } > > lines.toDF().registerTempTable("lines") > > %sql select firstField, secondField, count(1) from lines group by > firstField, secondField order by firstField, secondField > > 1. I can see that spark job was started on my YARN cluster. > 2. It failed > UI shows exception. Can't understand what do I do wrong: > %sql select firstField, secondField, count(1) from lines group by > firstField, secondField order by firstField, secondField ^ > 3. There is suspicious output in zeppelin log: > > INFO [2017-05-02 11:50:02,846] ({pool-2-thread-8} > SchedulerFactory.java[jobFinished]:137) > - Job paragraph_1493724118696_868476558 finished by scheduler > org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_ > session1712472970 > ERROR [2017-05-02 11:50:18,809] ({qtp1286783232-166} > NotebookServer.java[onMessage]:380) - Can't handle message > java.lang.NullPointerException > at org.apache.zeppelin.socket.NotebookServer. > addNewParagraphIfLastParagraphIsExecuted(NotebookServer.java:1713) > at org.apache.zeppelin.socket.NotebookServer. > persistAndExecuteSingleParagraph(NotebookServer.java:1741) > at org.apache.zeppelin.socket.NotebookServer.runAllParagraphs( > NotebookServer.java:1641) > at org.apache.zeppelin.socket.NotebookServer.onMessage( > NotebookServer.java:291) > at org.apache.zeppelin.socket.NotebookSocket.onWebSocketText( > NotebookSocket.java:59) > at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver. > onTextMessage(JettyListenerEventDriver.java:128) > at org.eclipse.jetty.websocket.common.message.SimpleTextMessage. > messageComplete(SimpleTextMessage.java:69) > at org.eclipse.jetty.websocket.common.events.AbstractEventDriver. > appendMessage(AbstractEventDriver.java:65) > at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver. > onTextFrame(JettyListenerEventDriver.java:122) > at org.eclipse.jetty.websocket.common.events.AbstractEventDriver. > incomingFrame(AbstractEventDriver.java:161) > at org.eclipse.jetty.websocket.common.WebSocketSession.incomingFrame( > WebSocketSession.java:309) > at org.eclipse.jetty.websocket.common.extensions. > ExtensionStack.incomingFrame(ExtensionStack.java:214) > at org.eclipse.jetty.websocket.common.Parser.notifyFrame(Parser.java:220) > at org.eclipse.jetty.websocket.common.Parser.parse(Parser.java:258) > at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection. > readParse(AbstractWebSocketConnection.java:632) > at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection. > onFillable(AbstractWebSocketConnection.java:480) > at org.eclipse.jetty.io.AbstractConnection$2.run( > AbstractConnection.java:544) > at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob( > QueuedThreadPool.java:635) > at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run( > QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > INFO [2017-05-02 11:50:18,811] ({pool-2-thread-14} > SchedulerFactory.java[jobStarted]:131) > - Job paragraph_1493724118696_868476558 started by scheduler > org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_ > session1712472970 > INFO [2017-05-02 11:50:18,812] ({pool-2-thread-14} > Paragraph.java[jobRun]:363) - run paragraph 20170502-112158_458502255 using > spark org.apache.zeppelin.interpreter.LazyOpenInterpreter@24ca4045 > WARN [2017-05-02 11:50:19,810] ({pool-2-thread-14} > NotebookServer.java[afterStatusChange]:2162) > - Job 20170502-112158_458502255 is finished, status: ERROR, exception: > null, result: %text linesText: org.apache.spark.rdd.RDD[String] = > hdfs://path/to/my/file.txt MapPartitionsRDD[21] at textFile at <console>:27 > defined class Line > lines: org.apache.spark.rdd.RDD[Line] = MapPartitionsRDD[22] at map at > <console>:31 > warning: there was one deprecation warning; re-run with -deprecation for > details > <console>:1: error: ';' expected but ',' found. > %sql select firstField, secondField, count(1) from lines group by > firstField, secondField order by firstField, secondField > > > 4. Interpreter log is also confusing: > INFO [2017-05-02 11:41:52,706] ({pool-2-thread-2} > Logging.scala[logInfo]:54) - Warehouse location for Hive client (version > 1.1.0) is file:/spark-warehouse > INFO [2017-05-02 11:41:52,983] ({pool-2-thread-2} > PerfLogger.java[PerfLogBegin]:122) - <PERFLOG method=create_database > from=org.apache.hadoop.hive.metastore.RetryingHMSHandler> > INFO [2017-05-02 11:41:52,984] ({pool-2-thread-2} > HiveMetaStore.java[logInfo]:795) - 0: create_database: > Database(name:default, description:default database, > locationUri:file:/spark-warehouse, parameters:{}) > INFO [2017-05-02 11:41:52,984] ({pool-2-thread-2} > HiveMetaStore.java[logAuditEvent]:388) > - ugi=zblenessy ip=unknown-ip-addr cmd=create_database: > Database(name:default, description:default database, > locationUri:file:/spark-warehouse, parameters:{}) > ERROR [2017-05-02 11:41:52,992] ({pool-2-thread-2} > RetryingHMSHandler.java[invokeInternal]:189) > - AlreadyExistsException(message:Database default already exists) > at org.apache.hadoop.hive.metastore.HiveMetaStore$ > HMSHandler.create_database(HiveMetaStore.java:944) > > What should I do with failed metastore DB creation? Is it fine? > > I'm stuck, can you give me some ideas please? >