AV created ZEPPELIN-3927: ---------------------------- Summary: Unstable State running Code Key: ZEPPELIN-3927 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3927 Project: Zeppelin Issue Type: Bug Components: zeppelin-interpreter Affects Versions: 0.9.0 Reporter: AV
Executing the tutorial notebook code produces weird results using Spark 2.4.0: > import org.apache.commons.io.IOUtils > import java.net.URL > import java.nio.charset.Charset > > > // Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext > or SqlContext) > // So you don't need create them manually > > // Remote Address > val csvURL = > "https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"; > > // Parallel processing > val bankText = sc.parallelize( IOUtils.toString( new URL(csvURL), > Charset.forName("UTF-8") ).toString().split("\n") ) > > case class Bank(age: Integer, job: String, marital: String, education: > String, balance: Integer) > > val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map( > s => Bank(s(0).toInt, > s(1).replaceAll("\"", ""), > s(2).replaceAll("\"", ""), > s(3).replaceAll("\"", ""), > s(5).replaceAll("\"", "").toInt > ) > ).toDF() > > bank.registerTempTable("bank") In the first run (after an spark interpreter restart), everything works fine, the output is: > warning: there was one deprecation warning; re-run with -deprecation for > details > import sqlContext.implicits._ > import org.apache.commons.io.IOUtils > import java.net.URL > import java.nio.charset.Charset csvURL: String = > [https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv] > bankText: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] > at parallelize at <console>:28 defined class Bank bank: > org.apache.spark.sql.DataFrame = [age: int, job: string ... 3 more fields] After the code has been executed once any re-run fails: > warning: there was one deprecation warning; re-run with -deprecation for > details > java.lang.IllegalArgumentException: URI is not absolute > at java.net.URI.toURL(URI.java:1088) > at > org.apache.hadoop.fs.http.AbstractHttpFileSystem.open(AbstractHttpFileSystem.java:60) > at org.apache.hadoop.fs.http.HttpsFileSystem.open(HttpsFileSystem.java:23) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at org.apache.hadoop.fs.FsUrlConnection.connect(FsUrlConnection.java:50) > at > org.apache.hadoop.fs.FsUrlConnection.getInputStream(FsUrlConnection.java:59) > at java.net.URL.openStream(URL.java:1045) > at org.apache.commons.io.IOUtils.toString(IOUtils.java:894) ... 39 elided The deprecation warning: > <console>:36: error: value toDF is not a member of > org.apache.spark.rdd.RDD[Bank] > possible cause: maybe a semicolon is missing before `value toDF'? ).toDF() Any ideas? ps.: I'm a little bit curious why there are no other messages regarding my problems. Using the latest stable spark / hadoop releases when compiling from source is natural for me. -- This message was sent by Atlassian JIRA (v7.6.3#76005)