spark1.6.2 ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread kevin
hi,all:
I build spark 1.6.2 frpm source with :
./make-distribution.sh --name "hadoop2.7.1" --tgz
"-Pyarn,hadoop-2.6,parquet-provided,hive,hive-thriftserver" -DskipTests
-Dhadoop.version=2.7.1

when I try to run :
./bin/run-example sql.RDDRelation
or
./spark-shell

I met the error with :(but I can run example
 about org.apache.spark.examples.SparkPi )

java.lang.NoClassDefFoundError:
org/apache/parquet/hadoop/ParquetOutputCommitter
at org.apache.spark.sql.SQLConf$.(SQLConf.scala:319)
at org.apache.spark.sql.SQLConf$.(SQLConf.scala)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:85)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.(:15)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
at
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at
org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
at
org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at
org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)
at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.parquet.hadoop.ParquetOutputCommitter
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 57 more

:16: error: not found: value sqlContext
 import sqlContext.implicits._
^
:16: error: not found: value sqlContext
 import sqlContext.sql


*what should  I do ?*


where I can find spark-streaming-kafka for spark2.0

2016-07-24 Thread kevin
hi,all :
I try to run example org.apache.spark.examples.streaming.KafkaWordCount , I
got error :
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/streaming/kafka/KafkaUtils$
at
org.apache.spark.examples.streaming.KafkaWordCount$.main(KafkaWordCount.scala:57)
at
org.apache.spark.examples.streaming.KafkaWordCount.main(KafkaWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.streaming.kafka.KafkaUtils$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 11 more

so where I can find spark-streaming-kafka for spark2.0


Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
I have compile it from source code

2016-07-25 12:05 GMT+08:00 kevin :

> hi,all :
> I try to run example org.apache.spark.examples.streaming.KafkaWordCount ,
> I got error :
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/streaming/kafka/KafkaUtils$
> at
> org.apache.spark.examples.streaming.KafkaWordCount$.main(KafkaWordCount.scala:57)
> at
> org.apache.spark.examples.streaming.KafkaWordCount.main(KafkaWordCount.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.streaming.kafka.KafkaUtils$
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 11 more
>
> so where I can find spark-streaming-kafka for spark2.0
>


spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
hi,all:
I download spark2.0 per-build. I can run SqlNetworkWordCount test use :
bin/run-example org.apache.spark.examples.streaming.SqlNetworkWordCount
master1 

but when I use spark2.0 example source code SqlNetworkWordCount.scala and
build it to a jar bao with dependencies ( JDK 1.8 AND SCALA2.10)
when I use spark-submit to run it I got error:

16/07/25 17:28:30 INFO scheduler.JobScheduler: Starting job streaming job
146943891 ms.0 from job set of time 146943891 ms
Exception in thread "streaming-job-executor-2" java.lang.NoSuchMethodError:
scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at
main.SqlNetworkWordCount$$anonfun$main$1.apply(SqlNetworkWordCount.scala:67)
at
main.SqlNetworkWordCount$$anonfun$main$1.apply(SqlNetworkWordCount.scala:61)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:247)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:247)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:247)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:246)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
Thank you,I can't find spark-streaming-kafka_2.10 jar for spark2 from maven
center. so I try the version 1.6.2,it not work ,it need class
org.apache.spark.Logging, which can't find in spark2. so I build
spark-streaming-kafka_2.10
jar for spark2 from the source code. it's work now.

2016-07-26 2:12 GMT+08:00 Cody Koeninger :

> For 2.0, the kafka dstream support is in two separate subprojects
> depending on which version of Kafka you are using
>
> spark-streaming-kafka-0-10
> or
> spark-streaming-kafka-0-8
>
> corresponding to brokers that are version 0.10+ or 0.8+
>
> On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin  wrote:
> > The presentation at Spark Summit SF was probably referring to Structured
> > Streaming. The existing Spark Streaming (dstream) in Spark 2.0 has the
> same
> > production stability level as Spark 1.6. There is also Kafka 0.10
> support in
> > dstream.
> >
> > On July 25, 2016 at 10:26:49 AM, Andy Davidson
> > (a...@santacruzintegration.com) wrote:
> >
> > Hi Kevin
> >
> > Just a heads up at the recent spark summit in S.F. There was a
> presentation
> > about streaming in 2.0. They said that streaming was not going to
> production
> > ready in 2.0.
> >
> > I am not sure if the older 1.6.x version will be supported. My project
> will
> > not be able to upgrade with streaming support. We also use kafka
> >
> > Andy
> >
> > From: Marco Mistroni 
> > Date: Monday, July 25, 2016 at 2:33 AM
> > To: kevin 
> > Cc: "user @spark" , "dev.spark"
> > 
> > Subject: Re: where I can find spark-streaming-kafka for spark2.0
> >
> > Hi Kevin
> >   you should not need to rebuild everything.
> > Instead, i believe you should launch spark-submit by specifying the kafka
> > jar file in your --packages... i had to follow same when integrating
> spark
> > streaming with flume
> >
> >   have you checked this link ?
> > https://spark.apache.org/docs/latest/streaming-kafka-integration.html
> >
> >
> > hth
> >
> >
> >
> > On Mon, Jul 25, 2016 at 10:20 AM, kevin  wrote:
> >>
> >> I have compile it from source code
> >>
> >> 2016-07-25 12:05 GMT+08:00 kevin :
> >>>
> >>> hi,all :
> >>> I try to run example
> org.apache.spark.examples.streaming.KafkaWordCount ,
> >>> I got error :
> >>> Exception in thread "main" java.lang.NoClassDefFoundError:
> >>> org/apache/spark/streaming/kafka/KafkaUtils$
> >>> at
> >>>
> org.apache.spark.examples.streaming.KafkaWordCount$.main(KafkaWordCount.scala:57)
> >>> at
> >>>
> org.apache.spark.examples.streaming.KafkaWordCount.main(KafkaWordCount.scala)
> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>> at
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >>> at
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>> at java.lang.reflect.Method.invoke(Method.java:498)
> >>> at
> >>>
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
> >>> at
> >>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> >>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> >>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> >>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >>> Caused by: java.lang.ClassNotFoundException:
> >>> org.apache.spark.streaming.kafka.KafkaUtils$
> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> >>> ... 11 more
> >>>
> >>> so where I can find spark-streaming-kafka for spark2.0
> >>
> >>
> >
>


Re: Odp.: spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
thanks a lot .after change to scala 2.11 , it works.

2016-07-25 17:40 GMT+08:00 Tomasz Gawęda :

> Hi,
>
> Please change Scala version to 2.11.  As far as I know, Spark packages are
> now build with Scala 2.11 and I've got other - 2.10 - version
>
>
>
> ------
> *Od:* kevin 
> *Wysłane:* 25 lipca 2016 11:33
> *Do:* user.spark; dev.spark
> *Temat:* spark2.0 can't run SqlNetworkWordCount
>
> hi,all:
> I download spark2.0 per-build. I can run SqlNetworkWordCount test use :
> bin/run-example org.apache.spark.examples.streaming.SqlNetworkWordCount
> master1 
>
> but when I use spark2.0 example source code SqlNetworkWordCount.scala and
> build it to a jar bao with dependencies ( JDK 1.8 AND SCALA2.10)
> when I use spark-submit to run it I got error:
>
> 16/07/25 17:28:30 INFO scheduler.JobScheduler: Starting job streaming job
> 146943891 ms.0 from job set of time 146943891 ms
> Exception in thread "streaming-job-executor-2"
> java.lang.NoSuchMethodError:
> scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
> at
> main.SqlNetworkWordCount$$anonfun$main$1.apply(SqlNetworkWordCount.scala:67)
> at
> main.SqlNetworkWordCount$$anonfun$main$1.apply(SqlNetworkWordCount.scala:61)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
> at scala.util.Try$.apply(Try.scala:192)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:247)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:247)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:247)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:246)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
>


spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
hi,all:
I want to read data from kafka and regist as a table then join a jdbc table.
My sample like this :

val spark = SparkSession
  .builder
  .config(sparkConf)
  .getOrCreate()

val jdbcDF = spark.read.format("jdbc").options(Map("url" ->
"jdbc:mysql://master1:3306/demo", "driver" -> "com.mysql.jdbc.Driver",
"dbtable" -> "i_user", "user" -> "root", "password" -> "passok")).load()
jdbcDF.cache().createOrReplaceTempView("black_book")
  val df = spark.sql("select * from black_book")
  df.show()

val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")

val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group,
topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))

*I got error :*

16/07/26 11:18:07 WARN AbstractHandler: No Server set for
org.spark_project.jetty.server.handler.ErrorHandler@6f0ca692
++++
|  id|username|password|
++++
|e6faca36-8766-4dc...|   a|   a|
|699285a3-a108-457...|   admin| 123|
|e734752d-ac98-483...|test|test|
|c0245226-128d-487...|   test2|   test2|
|4f1bbdb2-89d1-4cc...| 119| 911|
|16a9a360-13ee-4b5...|1215|1215|
|bf7d6a0d-2949-4c3...|   demo3|   demo3|
|de30747c-c466-404...| why| why|
|644741c9-8fd7-4a5...|   scala|   p|
|cda1e44d-af4b-461...| 123| 231|
|6e409ed9-c09b-4e7...| 798|  23|
++++

Exception in thread "main" org.apache.spark.SparkException: Only one
SparkContext may be running in this JVM (see SPARK-2243). To ignore this
error, set spark.driver.allowMultipleContexts = true. The currently running
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:749)
main.POC$.main(POC.scala:43)
main.POC.main(POC.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at
org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$2.apply(SparkContext.scala:2211)
at
org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$2.apply(SparkContext.scala:2207)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2207)
at
org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2277)
at org.apache.spark.SparkContext.(SparkContext.scala:91)
at
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:837)
at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:84)
at main.POC$.main(POC.scala:50)
at main.POC.main(POC.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Re: spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
thanks a lot Terry

2016-07-26 12:03 GMT+08:00 Terry Hoo :

> Kevin,
>
> Try to create the StreamingContext as following:
>
> val ssc = new StreamingContext(spark.sparkContext, Seconds(2))
>
>
>
> On Tue, Jul 26, 2016 at 11:25 AM, kevin  wrote:
>
>> hi,all:
>> I want to read data from kafka and regist as a table then join a jdbc
>> table.
>> My sample like this :
>>
>> val spark = SparkSession
>>   .builder
>>   .config(sparkConf)
>>   .getOrCreate()
>>
>> val jdbcDF = spark.read.format("jdbc").options(Map("url" ->
>> "jdbc:mysql://master1:3306/demo", "driver" -> "com.mysql.jdbc.Driver",
>> "dbtable" -> "i_user", "user" -> "root", "password" -> "passok")).load()
>> jdbcDF.cache().createOrReplaceTempView("black_book")
>>   val df = spark.sql("select * from black_book")
>>   df.show()
>>
>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>> ssc.checkpoint("checkpoint")
>>
>> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>> val lines = KafkaUtils.createStream(ssc, zkQuorum, group,
>> topicMap).map(_._2)
>> val words = lines.flatMap(_.split(" "))
>>
>> *I got error :*
>>
>> 16/07/26 11:18:07 WARN AbstractHandler: No Server set for
>> org.spark_project.jetty.server.handler.ErrorHandler@6f0ca692
>> ++++
>> |  id|username|password|
>> ++++
>> |e6faca36-8766-4dc...|   a|   a|
>> |699285a3-a108-457...|   admin| 123|
>> |e734752d-ac98-483...|test|test|
>> |c0245226-128d-487...|   test2|   test2|
>> |4f1bbdb2-89d1-4cc...| 119| 911|
>> |16a9a360-13ee-4b5...|1215|1215|
>> |bf7d6a0d-2949-4c3...|   demo3|   demo3|
>> |de30747c-c466-404...| why| why|
>> |644741c9-8fd7-4a5...|   scala|   p|
>> |cda1e44d-af4b-461...| 123| 231|
>> |6e409ed9-c09b-4e7...| 798|  23|
>> ++++
>>
>> Exception in thread "main" org.apache.spark.SparkException: Only one
>> SparkContext may be running in this JVM (see SPARK-2243). To ignore this
>> error, set spark.driver.allowMultipleContexts = true. The currently running
>> SparkContext was created at:
>>
>> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:749)
>> main.POC$.main(POC.scala:43)
>> main.POC.main(POC.scala)
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> java.lang.reflect.Method.invoke(Method.java:498)
>>
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> at
>> org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$2.apply(SparkContext.scala:2211)
>> at
>> org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$2.apply(SparkContext.scala:2207)
>> at scala.Option.foreach(Option.scala:257)
>> at
>> org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2207)
>> at
>> org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2277)
>> at org.apache.spark.SparkContext.(SparkContext.scala:91)
>> at
>> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:837)
>> at
>> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:84)
>> at main.POC$.main(POC.scala:50)
>> at main.POC.main(POC.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>>
>


tpcds for spark2.0

2016-07-27 Thread kevin
hi,all:
I want to have a test about tpcds99 sql run on spark2.0.
I user https://github.com/databricks/spark-sql-perf

about the master version ,when I run :val tpcds = new TPCDS (sqlContext =
sqlContext) I got error:

scala> val tpcds = new TPCDS (sqlContext = sqlContext)
error: missing or invalid dependency detected while loading class file
'Benchmarkable.class'.
Could not access term typesafe in package com,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see
the problematic classpath.)
A full rebuild may help if 'Benchmarkable.class' was compiled against an
incompatible version of com.
error: missing or invalid dependency detected while loading class file
'Benchmarkable.class'.
Could not access term scalalogging in value com.typesafe,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see
the problematic classpath.)
A full rebuild may help if 'Benchmarkable.class' was compiled against an
incompatible version of com.typesafe.

about spark-sql-perf-0.4.3 when I run
:tables.genData("hdfs://master1:9000/tpctest", "parquet", true, false,
false, false, false) I got error:

Generating table catalog_sales in database to
hdfs://master1:9000/tpctest/catalog_sales with save mode Overwrite.
16/07/27 18:59:59 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
slave1): java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD


Re: tpcds for spark2.0

2016-08-01 Thread kevin
finally I use  spark-sql-perf-0.4.3 :
./bin/spark-shell --jars
/home/dcos/spark-sql-perf-0.4.3/target/scala-2.11/spark-sql-perf_2.11-0.4.3.jar
--executor-cores 4 --executor-memory 10G --master spark://master1:7077
If I don't use "--jars" I will get error what I mentioned.

2016-07-29 21:17 GMT+08:00 Olivier Girardot :

> I have the same kind of issue (not using spark-sql-perf), just trying to
> deploy 2.0.0 on mesos.
> I'll keep you posted as I investigate
>
>
>
> On Wed, Jul 27, 2016 1:06 PM, kevin kiss.kevin...@gmail.com wrote:
>
>> hi,all:
>> I want to have a test about tpcds99 sql run on spark2.0.
>> I user https://github.com/databricks/spark-sql-perf
>>
>> about the master version ,when I run :val tpcds = new TPCDS (sqlContext =
>> sqlContext) I got error:
>>
>> scala> val tpcds = new TPCDS (sqlContext = sqlContext)
>> error: missing or invalid dependency detected while loading class file
>> 'Benchmarkable.class'.
>> Could not access term typesafe in package com,
>> because it (or its dependencies) are missing. Check your build definition
>> for
>> missing or conflicting dependencies. (Re-run with -Ylog-classpath to see
>> the problematic classpath.)
>> A full rebuild may help if 'Benchmarkable.class' was compiled against an
>> incompatible version of com.
>> error: missing or invalid dependency detected while loading class file
>> 'Benchmarkable.class'.
>> Could not access term scalalogging in value com.typesafe,
>> because it (or its dependencies) are missing. Check your build definition
>> for
>> missing or conflicting dependencies. (Re-run with -Ylog-classpath to see
>> the problematic classpath.)
>> A full rebuild may help if 'Benchmarkable.class' was compiled against an
>> incompatible version of com.typesafe.
>>
>> about spark-sql-perf-0.4.3 when I run
>> :tables.genData("hdfs://master1:9000/tpctest", "parquet", true, false,
>> false, false, false) I got error:
>>
>> Generating table catalog_sales in database to
>> hdfs://master1:9000/tpctest/catalog_sales with save mode Overwrite.
>> 16/07/27 18:59:59 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
>> slave1): java.lang.ClassCastException: cannot assign instance of
>> scala.collection.immutable.List$SerializationProxy to field
>> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
>> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>>
>>
>
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>


Re: welcoming Xiao Li as a committer

2016-10-04 Thread Kevin
Congratulations Xiao!!

Sent from my iPhone

> On Oct 4, 2016, at 3:59 AM, Tarun Kumar  wrote:
> 
> Congrats Xiao.
> 
> Thanks
> Tarun
>> On Tue, 4 Oct 2016 at 12:57 PM, Cheng Lian  wrote:
>> Congratulations!!!
>> 
>> 
>> Cheng
>> 
>> On Tue, Oct 4, 2016 at 1:46 PM, Reynold Xin  wrote:
>> Hi all,
>> 
>> Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark 
>> committer. Xiao has been a super active contributor to Spark SQL. Congrats 
>> and welcome, Xiao!
>> 
>> - Reynold


Java Code Style

2021-02-20 Thread Pis Kevin
Hi,

I use google java code style in intellj idea. But when I reformat the following 
codes, its  inconsistent with the code in spark.

Before reformat:
[cid:image001.png@01D707DA.2FACD5C0]

After reformat:

[cid:image002.png@01D707DA.2FACD5C0]

Why? And how to fix the issue.


Re: Java Code Style

2021-02-21 Thread Kevin Pis
Ok, thanks.

Sean Owen  于2021年2月21日周日 上午12:33写道:

> Do you just mean you want to adjust the code style rules? Yes you can do
> that in IJ, just a matter of finding the indent rule to adjust.
> The Spark style is pretty normal stuff, though not 100% consistent.I
> prefer the first style in this case. Sometimes it's a matter of judgment
> when to differ from a standard style for better readability.
>
> On Sat, Feb 20, 2021 at 8:53 AM Pis Kevin  wrote:
>
>> Hi,
>>
>>
>>
>> I use google java code style in intellj idea. But when I reformat the
>> following codes, its  inconsistent with the code in spark.
>>
>>
>>
>> Before reformat:
>>
>>
>>
>> After reformat:
>>
>>
>>
>>
>>
>> Why? And how to fix the issue.
>>
>

-- 

Best,

Kevin Pis


Fail to run benchmark in Github Action

2021-06-25 Thread Kevin Su
Hi all,

I try to run a benchmark test in GitHub action in my fork, and I faced the
below error.
https://github.com/pingsutw/spark/runs/2867617238?check_suite_focus=true
java.lang.AssertionError: assertion failed: spark.test.home is not set!
23799

at scala.Predef$.assert(Predef.scala:223)
23800

at org.apache.spark.deploy.worker.Worker.(Worker.scala:148)
23801

at
org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:954)

23802

at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2(LocalSparkCluster.scala:68)

23803

at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2$adapted(LocalSparkCluster.scala:65)

23804

at scala.collection.immutable.Range.foreach(Range.scala:158)

After I add the  "--driver-java-options
"-Dspark.test.home=$GITHUB_WORKSPACE" \" in benchmark.yml


I still got the below error.
https://github.com/pingsutw/spark/runs/2911027350?check_suite_focus=true
.
Do I need to set something up in my fork?
after 1900, vec on, rebase EXCEPTION 7474 7511 58 13.4 74.7 2.7X
4427
after
1900, vec on, rebase LEGACY 9228 9296 60 10.8 92.3 2.2X
4428
after
1900, vec on, rebase CORRECTED 7553 7678 128 13.2 75.5 2.7X
4429
before
1900, vec off, rebase LEGACY 23280 23362 71 4.3 232.8 0.9X
4430
before
1900, vec off, rebase CORRECTED 20548 20630 119 4.9 205.5 1.0X
4431
before
1900, vec on, rebase LEGACY 12210 12239 37 8.2 122.1 1.7X
4432
before
1900, vec on, rebase CORRECTED 7486 7489 2 13.4 74.9 2.7X
4433

4434
Running
benchmark: Save TIMESTAMP_MICROS to parquet
4435

Running case: after 1900, noop
4436

Stopped after 1 iterations, 4003 ms
4437

Running case: before 1900, noop
4438

Stopped after 1 iterations, 3965 ms
4439

Running case: after 1900, rebase EXCEPTION
4440

Stopped after 1 iterations, 18339 ms
4441

Running case: after 1900, rebase LEGACY
4442

Stopped after 1 iterations, 18375 ms
4443

Running case: after 1900, rebase CORRECTED


Stopped after 1 iterations, 18716 ms
4445

Running case: before 1900, rebase LEGACY
4446
Error:
The operation was canceled.


How to run spark benchmark on standalone cluster?

2021-07-02 Thread Kevin Su
Hi all,

I want to run spark benchmark on a standalone cluster, and I have changed
the DataSourceReadBenchmark.scala setting. (Remove "spark.master")

--- a/sql/core/src/test
/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
+++ b/sql/core/src/test
/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
@@ -48,7 +48,6 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark {
val conf = new SparkConf()
  .setAppName("DataSourceReadBenchmark")
  // Since `spark.master` always exists, overrides this value
-  .set("spark.master", "local[1]")
  .setIfMissing("spark.driver.memory", "3g")
  .setIfMissing("spark.executor.memory", "3g")

I ran the benchmark using below command

bin/spark-submit \

--driver-memory 16g \

--master spark://kobe-pc:7077 \

--class org.apache.spark.benchmark.Benchmarks \

--jars \ "`find . -name '*-SNAPSHOT-tests.jar' -o -name
'*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \ "`find . -name
'spark-core*-SNAPSHOT-tests.jar'`" \
"org.apache.spark.sql.execution.datasources.*"

I met the below error

Driver stacktrace:
21/07/02 22:35:13 INFO DAGScheduler: Job 0 failed: apply at
BenchmarkBase.scala:42, took 1.374943 s
21/07/02 22:35:13 ERROR FileFormatWriter: Aborting job
a6ceeb0c-5f9d-44ca-a896-65d4a7b8b948.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in
stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0
(TID 7) (192.168.103.14 executor 0): java.lang.NoClassDefFoundError: Could
not initialize class
org.apache.spark.sql.execution.datasources.csv.CSVBenchmark$
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)


How can I get the same spark context in two different python processes

2022-12-12 Thread Kevin Su
Hey there, How can I get the same spark context in two different python
processes?
Let’s say I create a context in Process A, and then I want to use python
subprocess B to get the spark context created by Process A. How can I
achieve that?

I've tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(),
but it will create a new spark context.


Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Kevin Su
Also, is there any way to workaround this issue without using Spark connect?

Kevin Su  於 2022年12月12日 週一 下午2:52寫道:

> nvm, I found the ticket.
> Also, is there any way to workaround this issue without using Spark
> connect?
>
> Kevin Su  於 2022年12月12日 週一 下午2:42寫道:
>
>> Thanks for the quick response? Do we have any PR or Jira ticket for it?
>>
>> Reynold Xin  於 2022年12月12日 週一 下午2:39寫道:
>>
>>> Spark Connect :)
>>>
>>> (It’s work in progress)
>>>
>>>
>>> On Mon, Dec 12 2022 at 2:29 PM, Kevin Su  wrote:
>>>
>>>> Hey there, How can I get the same spark context in two different python
>>>> processes?
>>>> Let’s say I create a context in Process A, and then I want to use
>>>> python subprocess B to get the spark context created by Process A. How can
>>>> I achieve that?
>>>>
>>>> I've
>>>> tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but
>>>> it will create a new spark context.
>>>>
>>>


Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Kevin Su
I ran my spark job by using databricks job with a single python script.
IIUC, the databricks platform will create a spark context for this python
script.
However, I create a new subprocess in this script and run some spark code
in this subprocess, but this subprocess can't find the context created by
databricks.
Not sure if there is any api I can use to get the default context.

bo yang  於 2022年12月12日 週一 下午3:27寫道:

> In theory, maybe a Jupyter notebook or something similar could achieve
> this? e.g. running some Jypyter kernel inside Spark driver, then another
> Python process could connect to that kernel.
>
> But in the end, this is like Spark Connect :)
>
>
> On Mon, Dec 12, 2022 at 2:55 PM Kevin Su  wrote:
>
>> Also, is there any way to workaround this issue without using Spark
>> connect?
>>
>> Kevin Su  於 2022年12月12日 週一 下午2:52寫道:
>>
>>> nvm, I found the ticket.
>>> Also, is there any way to workaround this issue without using Spark
>>> connect?
>>>
>>> Kevin Su  於 2022年12月12日 週一 下午2:42寫道:
>>>
>>>> Thanks for the quick response? Do we have any PR or Jira ticket for it?
>>>>
>>>> Reynold Xin  於 2022年12月12日 週一 下午2:39寫道:
>>>>
>>>>> Spark Connect :)
>>>>>
>>>>> (It’s work in progress)
>>>>>
>>>>>
>>>>> On Mon, Dec 12 2022 at 2:29 PM, Kevin Su  wrote:
>>>>>
>>>>>> Hey there, How can I get the same spark context in two different
>>>>>> python processes?
>>>>>> Let’s say I create a context in Process A, and then I want to use
>>>>>> python subprocess B to get the spark context created by Process A. How 
>>>>>> can
>>>>>> I achieve that?
>>>>>>
>>>>>> I've
>>>>>> tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), 
>>>>>> but
>>>>>> it will create a new spark context.
>>>>>>
>>>>>


Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Kevin Su
Maciej, Thanks for the reply.
Could you share an example to achieve it?

Maciej  於 2022年12月12日 週一 下午4:41寫道:

> Technically speaking, it is possible in stock distribution (can't speak
> for Databricks) and not super hard to do (just check out how we
> initialize sessions), but definitely not something that we test or
> support, especially in a scenario you described.
>
> If you want to achieve concurrent execution, multithreading is normally
> more than sufficient and avoids problems with the context.
>
>
>
> On 12/13/22 00:40, Kevin Su wrote:
> > I ran my spark job by using databricks job with a single python script.
> > IIUC, the databricks platform will create a spark context for this
> > python script.
> > However, I create a new subprocess in this script and run some spark
> > code in this subprocess, but this subprocess can't find the
> > context created by databricks.
> > Not sure if there is any api I can use to get the default context.
> >
> > bo yang mailto:bobyan...@gmail.com>> 於 2022年12月
> > 12日 週一 下午3:27寫道:
> >
> > In theory, maybe a Jupyter notebook or something similar could
> > achieve this? e.g. running some Jypyter kernel inside Spark driver,
> > then another Python process could connect to that kernel.
> >
> > But in the end, this is like Spark Connect :)
> >
> >
> > On Mon, Dec 12, 2022 at 2:55 PM Kevin Su  > <mailto:pings...@gmail.com>> wrote:
> >
> > Also, is there any way to workaround this issue without
> > using Spark connect?
> >
> > Kevin Su mailto:pings...@gmail.com>> 於
> > 2022年12月12日 週一 下午2:52寫道:
> >
> > nvm, I found the ticket.
> > Also, is there any way to workaround this issue without
> > using Spark connect?
> >
> > Kevin Su mailto:pings...@gmail.com>> 於
> > 2022年12月12日 週一 下午2:42寫道:
> >
> > Thanks for the quick response? Do we have any PR or Jira
> > ticket for it?
> >
> > Reynold Xin  > <mailto:r...@databricks.com>> 於 2022年12月12日 週一 下
> > 午2:39寫道:
> >
> > Spark Connect :)
> >
> > (It’s work in progress)
> >
> >
> > On Mon, Dec 12 2022 at 2:29 PM, Kevin Su
> > mailto:pings...@gmail.com>>
> wrote:
> >
> > Hey there, How can I get the same spark context
> > in two different python processes?
> > Let’s say I create a context in Process A, and
> > then I want to use python subprocess B to get
> > the spark context created by Process A. How can
> > I achieve that?
> >
> > I've
> >
>  tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but
> it will create a new spark context.
> >
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>


Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Kevin Su
Hi Jack,

My use case is a bit different, I created a subprocess instead of thread. I
can't pass the args to subprocess.

Jack Goodson  於 2022年12月12日 週一 晚上8:03寫道:

> apologies, the code should read as below
>
> from threading import Thread
>
> context = pyspark.sql.SparkSession.builder.appName("spark").getOrCreate()
>
> t1 = Thread(target=my_func, args=(context,))
> t1.start()
>
> t2 = Thread(target=my_func, args=(context,))
> t2.start()
>
> On Tue, Dec 13, 2022 at 4:10 PM Jack Goodson 
> wrote:
>
>> Hi Kevin,
>>
>> I had a similar use case (see below code) but with something that wasn’t
>> spark related. I think the below should work for you, you may need to edit
>> the context variable to suit your needs but hopefully it gives the general
>> idea of sharing a single object between multiple threads.
>>
>> Thanks
>>
>>
>> from threading import Thread
>>
>> context = pyspark.sql.SparkSession.builder.appName("spark").getOrCreate()
>>
>> t1 = Thread(target=order_creator, args=(app_id, sleep_time,))
>> t1.start(target=my_func, args=(context,))
>>
>> t2 = Thread(target=order_creator, args=(app_id, sleep_time,))
>> t2.start(target=my_func, args=(context,))
>>
>


RE: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Kevin Grealish
Any update on expected 2.2.1 (or 2.3.0) release process?

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, October 26, 2017 10:04 AM
To: Sean Owen ; Holden Karau 
Cc: dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

Yes! I can take on RM for 2.2.1.

We are still working out what to do with temp files created by Hive and Java 
that cause the policy issue with CRAN and will report back shortly, hopefully.


From: Sean Owen mailto:so...@cloudera.com>>
Sent: Wednesday, October 25, 2017 4:39:15 AM
To: Holden Karau
Cc: Felix Cheung; dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

It would be reasonably consistent with the timing of other x.y.1 releases, and 
more release managers sounds useful, yeah.

Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks.

On Wed, Oct 25, 2017 at 12:29 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Now that Spark 2.1.2 is out it seems like now is a good time to get started on 
the Spark 2.2.1 release. There are some streaming fixes I'm aware of that would 
be good to get into a release, is there anything else people are working on for 
2.2.1 we should be tracking?

To switch it up I'd like to suggest Felix to be the RM for this since there are 
also likely some R packaging changes to be included in the release. This also 
gives us a chance to see if my updated release documentation if enough for a 
new RM to get started from.

What do folks think?
--
Twitter: 
https://twitter.com/holdenkarau


regression: no longer able to use HDFS wasbs:// path for additional python files on LIVY batch submit

2016-09-30 Thread Kevin Grealish
I'm seeing a regression when submitting a batch PySpark program with additional 
files using LIVY. This is YARN cluster mode. The program files are placed into 
the mounted Azure Storage before making the call to LIVY. This is happening 
from an application which has credentials for the storage and the LIVY 
endpoint, but not local file systems on the cluster. This previously worked but 
now I'm getting the error below.

Seems this restriction was introduced with 
https://github.com/apache/spark/commit/5081a0a9d47ca31900ea4de570de2cbb0e063105 
(new in 1.6.2 and 2.0.0).

How should the scenario above be achieved now? Am I missing something?


Exception in thread "main" java.lang.IllegalArgumentException: Launching Python 
applications through spark-submit is currently only supported for local files: 
wasb://kevingreclust...@.blob.core.windows.net/x/xxx.py
at 
org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
at 
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at 
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$11.apply(SparkSubmit.scala:639)
at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$11.apply(SparkSubmit.scala:637)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:637)
at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
java.lang.Exception: spark-submit exited with code 1}.



RE: regression: no longer able to use HDFS wasbs:// path for additional python files on LIVY batch submit

2016-10-03 Thread Kevin Grealish
Great. Thanks for the pointer. I see the fix is in 2.0.1-rc4.

Will there be a 1.6.3? If so, how are fixes considered for backporting?

From: Steve Loughran [mailto:ste...@hortonworks.com]
Sent: Monday, October 3, 2016 5:40 AM
To: Kevin Grealish 
Cc: Apache Spark Dev 
Subject: Re: regression: no longer able to use HDFS wasbs:// path for 
additional python files on LIVY batch submit


On 1 Oct 2016, at 02:49, Kevin Grealish 
mailto:kevin...@microsoft.com>> wrote:

I’m seeing a regression when submitting a batch PySpark program with additional 
files using LIVY. This is YARN cluster mode. The program files are placed into 
the mounted Azure Storage before making the call to LIVY. This is happening 
from an application which has credentials for the storage and the LIVY 
endpoint, but not local file systems on the cluster. This previously worked but 
now I’m getting the error below.

Seems this restriction was introduced with 
https://github.com/apache/spark/commit/5081a0a9d47ca31900ea4de570de2cbb0e063105<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fcommit%2F5081a0a9d47ca31900ea4de570de2cbb0e063105&data=01%7C01%7Ckevingre%40microsoft.com%7C6de8fd563cb143a4015108d3eb8a73a9%7C72f988bf86f141af91ab2d7cd011db47%7C1&sdata=YiYyvdkzUMPKAHC6hPzN2kKm6vkgJWsb4a6KpkSUa18%3D&reserved=0>
 (new in 1.6.2 and 2.0.0).

How should the scenario above be achieved now? Am I missing something?

This has been fixed in 
https://issues.apache.org/jira/browse/SPARK-17512<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-17512&data=01%7C01%7Ckevingre%40microsoft.com%7C6de8fd563cb143a4015108d3eb8a73a9%7C72f988bf86f141af91ab2d7cd011db47%7C1&sdata=zh7rOQL1s2ZSIdqW%2Fz0PktGPcFpMQ7HRFKETp5qIhJk%3D&reserved=0>
 ; I don't know if its in 2.0.1 though



Exception in thread "main" java.lang.IllegalArgumentException: Launching Python 
applications through spark-submit is currently only supported for local files: 
wasb://kevingreclust...@.blob.core.windows.net/x/xxx.py
at 
org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
at 
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at 
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$11.apply(SparkSubmit.scala:639)
at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$11.apply(SparkSubmit.scala:637)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:637)
at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
java.lang.Exception: spark-submit exited with code 1}.



Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-29 Thread Kevin Yu
Congratulations, Jerry!

On Tue, Aug 29, 2017 at 6:35 AM, Meisam Fathi 
wrote:

> Congratulations, Jerry!
>
> Thanks,
> Meisam
>
> On Tue, Aug 29, 2017 at 1:13 AM Wang, Carson 
> wrote:
>
>> Congratulations, Saisai!
>>
>>
>> -Original Message-
>> From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
>> Sent: Tuesday, August 29, 2017 9:29 AM
>> To: dev 
>> Cc: Saisai Shao 
>> Subject: Welcoming Saisai (Jerry) Shao as a committer
>>
>> Hi everyone,
>>
>> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai
>> has been contributing to many areas of the project for a long time, so it’s
>> great to see him join. Join me in thanking and congratulating him!
>>
>> Matei
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Kevin Markey

+0 (non-binding)

Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn, 
plain-vanilla Hadoop 2.3.0. No regressions.


However, 12% to 22% increase in run time relative to 1.0.0 release.  (No 
other environment or configuration changes.)  Would have recommended +1 
were it not for added latency.


Not sure if added latency a function of 1.0 vs 1.1 or 1.0 vs 1.1.1 
changes, as we've never tested with 1.1.0. But thought I'd share the 
results.  (This is somewhat disappointing.)


Kevin Markey

On 11/17/2014 11:42 AM, Debasish Das wrote:

Andrew,

I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap
followed by groupBy...My cluster memory is less than the memory I need and
therefore flatMap does around 400 GB of shuffle...memory is around 120 GB...

14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
mapId=-1, reduceId=22)

I searched on user-list and this issue has been found over there:

http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html

I wanted to make sure whether 1.1.1 does not have the same bug...-1 from me
till we figure out the root cause...

Thanks.

Deb

On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or  wrote:


This seems like a legitimate blocker. We will cut another RC to include the
revert.

2014-11-16 17:29 GMT-08:00 Kousuke Saruta :


Now I've finished to revert for SPARK-4434 and opened PR.


(2014/11/16 17:08), Josh Rosen wrote:


-1

I found a potential regression in 1.1.1 related to spark-submit and
cluster
deploy mode: https://issues.apache.org/jira/browse/SPARK-4434

I think that this is worth fixing.

On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian 
wrote:

  +1


Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
are
fixed. Hive version inspection works as expected.


On 11/15/14 8:25 AM, Zach Fry wrote:

  +0


I expect to start testing on Monday but won't have enough results to
change
my vote from +0
until Monday night or Tuesday morning.

Thanks,
Zach



--
View this message in context: http://apache-spark-
developers-list.1001551.n3.nabble.com/VOTE-Release-
Apache-Spark-1-1-1-RC1-tp9311p9370.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



  -

To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org








-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-16 Thread Kevin Markey
In some applications, I have rather heavy use of Java enums which are 
needed for related Java APIs that the application uses.  And 
unfortunately, they are also used as keys.  As such, using the native 
hashcodes makes any function over keys unstable and unpredictable, so we 
now use Enum.name() as the key instead.  Oh well.  But it works and 
seems to work well.


Kevin

On 03/05/2015 09:49 PM, Mridul Muralidharan wrote:

   I have a strong dislike for java enum's due to the fact that they
are not stable across JVM's - if it undergoes serde, you end up with
unpredictable results at times [1].
One of the reasons why we prevent enum's from being key : though it is
highly possible users might depend on it internally and shoot
themselves in the foot.

Would be better to keep away from them in general and use something more stable.

Regards,
Mridul

[1] Having had to debug this issue for 2 weeks - I really really hate it.


On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid  wrote:

I have a very strong dislike for #1 (scala enumerations).   I'm ok with #4
(with Xiangrui's final suggestion, especially making it sealed & available
in Java), but I really think #2, java enums, are the best option.

Java enums actually have some very real advantages over the other
approaches -- you get values(), valueOf(), EnumSet, and EnumMap.  There has
been endless debate in the Scala community about the problems with the
approaches in Scala.  Very smart, level-headed Scala gurus have complained
about their short-comings (Rex Kerr's name is coming to mind, though I'm
not positive about that); there have been numerous well-thought out
proposals to give Scala a better enum.  But the powers-that-be in Scala
always reject them.  IIRC the explanation for rejecting is basically that
(a) enums aren't important enough for introducing some new special feature,
scala's got bigger things to work on and (b) if you really need a good
enum, just use java's enum.

I doubt it really matters that much for Spark internals, which is why I
think #4 is fine.  But I figured I'd give my spiel, because every developer
loves language wars :)

Imran



On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng  wrote:


`case object` inside an `object` doesn't show up in Java. This is the
minimal code I found to make everything show up correctly in both
Scala and Java:

sealed abstract class StorageLevel // cannot be a trait

object StorageLevel {
   private[this] case object _MemoryOnly extends StorageLevel
   final val MemoryOnly: StorageLevel = _MemoryOnly

   private[this] case object _DiskOnly extends StorageLevel
   final val DiskOnly: StorageLevel = _DiskOnly
}

On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell 
wrote:

I like #4 as well and agree with Aaron's suggestion.

- Patrick

On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson 

wrote:

I'm cool with #4 as well, but make sure we dictate that the values

should

be defined within an object with the same name as the enumeration (like

we

do for StorageLevel). Otherwise we may pollute a higher namespace.

e.g. we SHOULD do:

trait StorageLevel
object StorageLevel {
   case object MemoryOnly extends StorageLevel
   case object DiskOnly extends StorageLevel
}

On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust <

mich...@databricks.com>

wrote:


#4 with a preference for CamelCaseEnums

On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley 
wrote:


another vote for #4
People are already used to adding "()" in Java.


On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch 

wrote:

#4 but with MemoryOnly (more scala-like)

http://docs.scala-lang.org/style/naming-conventions.html

Constants, Values, Variable and Methods

Constant names should be in upper camel case. That is, if the

member is

final, immutable and it belongs to a package object or an object,

it

may

be

considered a constant (similar to Java'sstatic final members):


1. object Container {
2. val MyConstant = ...
3. }


2015-03-04 17:11 GMT-08:00 Xiangrui Meng :


Hi all,

There are many places where we use enum-like types in Spark, but

in

different ways. Every approach has both pros and cons. I wonder
whether there should be an "official" approach for enum-like

types in

Spark.

1. Scala's Enumeration (e.g., SchedulingMode, WorkerState, etc)

* All types show up as Enumeration.Value in Java.



http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html

2. Java's Enum (e.g., SaveMode, IOMode)

* Implementation must be in a Java file.
* Values doesn't show up in the ScalaDoc:



http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode

3. Static fields in Java (e.g., TripletFields)

* Implementation must be in a Java file.
* Doesn't need "()" in Java code.
* Values don't show up in the ScalaDoc:



http://spark.apache.org/docs/latest/api/scala/#org.ap

Re: Change for submitting to yarn in 1.3.1

2015-05-12 Thread Kevin Markey

We have the same issue.  As result, we are stuck back on 1.0.2.

Not being able to programmatically interface directly with the Yarn 
client to obtain the application id is a show stopper for us, which is a 
real shame given the Yarn enhancements in 1.2, 1.3, and 1.4.


I understand that SparkLauncher was supposed to address these issues, 
but it really doesn't.  Yarn already provides indirection and an arm's 
length transaction for starting Spark on a cluster. The launcher 
introduces yet another layer of indirection and dissociates the Yarn 
Client from the application that launches it.


I am still reading the newest code, and we are still researching options 
to move forward.  If there are alternatives, we'd like to know.


Kevin Markey


On 05/11/2015 01:36 AM, Mridul Muralidharan wrote:

That works when it is launched from same process - which is
unfortunately not our case :-)

- Mridul

On Sun, May 10, 2015 at 9:05 PM, Manku Timma  wrote:

sc.applicationId gives the yarn appid.

On 11 May 2015 at 08:13, Mridul Muralidharan  wrote:

We had a similar requirement, and as a stopgap, I currently use a
suboptimal impl specific workaround - parsing it out of the
stdout/stderr (based on log config).
A better means to get to this is indeed required !

Regards,
Mridul

On Sun, May 10, 2015 at 7:33 PM, Ron's Yahoo!
 wrote:

Hi,
   I used to submit my Spark yarn applications by using
org.apache.spark.yarn.deploy.Client api so I can get the application id
after I submit it. The following is the code that I have, but after
upgrading to 1.3.1, the yarn Client class was made into a private class. Is
there a particular reason why this Client class was made private?
   I know that there’s a new SparkSubmit object that can be used, but
it’s not clear to me how I can use it to get the application id after
submitting to the cluster.
   Thoughts?

Thanks,
Ron

class SparkLauncherServiceImpl extends SparkLauncherService {

   override def runApp(conf: Configuration, appName: String, queue:
String): ApplicationId = {
 val ws = SparkLauncherServiceImpl.getWorkspace()
 val params = Array("--class", //
 "com.xyz.sparkdb.service.impl.AssemblyServiceImpl", //
 "--name", appName, //
 "--queue", queue, //
 "--driver-memory", "1024m", //
 "--addJars",
getListOfDependencyJars(s"$ws/ledp/le-sparkdb/target/dependency"), //
 "--jar",
s"file:$ws/ledp/le-sparkdb/target/le-sparkdb-1.0.3-SNAPSHOT.jar")
 System.setProperty("SPARK_YARN_MODE", "true")
 System.setProperty("spark.driver.extraJavaOptions",
"-XX:PermSize=128m -XX:MaxPermSize=128m
-Dsun.io.serialization.extendedDebugInfo=true")
 val sparkConf = new SparkConf()
 val args = new ClientArguments(params, sparkConf)
 new Client(args, conf, sparkConf).runApp()
   }

   private def getListOfDependencyJars(baseDir: String): String = {
 val files = new
File(baseDir).listFiles().filter(!_.getName().startsWith("spark-assembly"))
 val prependedFiles = files.map(x => "file:" + x.getAbsolutePath())
 val result = ((prependedFiles.tail.foldLeft(new
StringBuilder(prependedFiles.head))) {(acc, e) => acc.append(",
").append(e)}).toString()
 result
   }
}


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Change for submitting to yarn in 1.3.1

2015-05-21 Thread Kevin Markey
PI private was that it was never
intended to be used by third parties programmatically and we don't
intend to support it in its current form as a stable API. We thought
the fact that it was for internal use would be obvious since it
accepts arguments as a string array of CL args. It was always intended
for command line use and the stable API was the command line.

When we migrated the Launcher library we figured we covered most of
the use cases in the off chance someone was using the Client. It
appears we regressed one feature which was a clean way to get the app
ID.

The items you list here 2-6 all seem like new feature requests rather
than a regression caused by us making that API private.

I think the way to move forward is for someone to design a proper
long-term stable API for the things you mentioned here. That could
either be by extension of the Launcher library. Marcelo would be
natural to help with this effort since he was heavily involved in both
YARN support and the launcher. So I'm curious to hear his opinion on
how best to move forward.

I do see how apps that run Spark would benefit of having a control
plane for querying status, both on YARN and elsewhere.

- Patrick

On Wed, May 13, 2015 at 5:44 AM, Chester At Work  wrote:

  
Patrick
 There are several things we need, some of them already mentioned in the mailing list before.

I haven't looked at the SparkLauncher code, but here are few things we need from our perspectives for Spark Yarn Client

 1) client should not be private ( unless alternative is provided) so we can call it directly.
 2) we need a way to stop the running yarn app programmatically ( the PR is already submitted)
 3) before we start the spark job, we should have a call back to the application, which will provide the yarn container capacity (number of cores and max memory ), so spark program will not set values beyond max values (PR submitted)
 4) call back could be in form of yarn app listeners, which call back based on yarn status changes ( start, in progress, failure, complete etc), application can react based on these events in PR)

 5) yarn client passing arguments to spark program in the form of main program, we had experience problems when we pass a very large argument due the length limit. For example, we use json to serialize the argument and encoded, then parse them as argument. For wide columns datasets, we will run into limit. Therefore, an alternative way of passing additional larger argument is needed. We are experimenting with passing the args via a established akka messaging channel.

6) spark yarn client in yarn-cluster mode right now is essentially a batch job with no communication once it launched. Need to establish the communication channel so that logs, errors, status updates, progress bars, execution stages etc can be displayed on the application side. We added an akka communication channel for this (working on PR ).

   Combined with others items in this list, we are able to redirect print and error statement to application log (outside of the hadoop cluster), so spark UI equivalent progress bar via spark listener. We can show yarn progress via yarn app listener before spark started; and status can be updated during job execution.

We are also experimenting with long running job with additional spark commands and interactions via this channel.


 Chester









Sent from my iPad

On May 12, 2015, at 20:54, Patrick Wendell  wrote:



  Hey Kevin and Ron,

So is the main shortcoming of the launcher library the inability to
get an app ID back from YARN? Or are there other issues here that
fundamentally regress things for you.

It seems like adding a way to get back the appID would be a reasonable
addition to the launcher.

- Patrick

On Tue, May 12, 2015 at 12:51 PM, Marcelo Vanzin  wrote:

  
On Tue, May 12, 2015 at 11:34 AM, Kevin Markey 
wrote:



  I understand that SparkLauncher was supposed to address these issues, but
it really doesn't.  Yarn already provides indirection and an arm's length
transaction for starting Spark on a cluster. The launcher introduces yet
another layer of indirection and dissociates the Yarn Client from the
application that launches it.




Well, not fully. The launcher was supposed to solve "how to launch a Spark
app programatically", but in the first version nothing was added to
actually gather information about the running app. It's also limited in the
way it works because of Spark's limitations (one context per JVM, etc).

Still, adding things like this is something that is definitely in the scope
for the launcher library; information such as app id can be useful for the
code launching the app, not just in yarn mode. We just have to find a clean
way to provide that information to the caller.




  I am still readi

Deserializing JSON into Scala objects in Java code

2015-09-08 Thread Kevin Chen
Hello Spark Devs,

 I am trying to use the new Spark API json endpoints at /api/v1/[path]
(added in SPARK-3454).

 In order to minimize maintenance on our end, I would like to use
Retrofit/Jackson to parse the json directly into the Scala classes in
org/apache/spark/status/api/v1/api.scala (ApplicationInfo,
ApplicationAttemptInfo, etc…). However, Jackson does not seem to know how to
handle Scala Seqs, and will throw an error when trying to parse the
attempts: Seq[ApplicationAttemptInfo] field of ApplicationInfo. Our codebase
is in Java.

 My questions are:
1. Do you have any recommendations on how to easily deserialize Scala
objects from json? For example, do you have any current usage examples of
SPARK-3454 with Java?
2. Alternatively, are you committed to the json formats of /api/v1/path? I
would guess so, because of the ‘v1’, but wanted to confirm. If so, I could
deserialize the json into instances of my own Java classes instead, without
worrying about changing the class structure later due to changes in the
Spark API.
Some further information:
* The error I am getting with Jackson when trying to deserialize the json
into ApplicationInfo is Caused by:
com.fasterxml.jackson.databind.JsonMappingException: Can not construct
instance of scala.collection.Seq, problem: abstract types either need to be
mapped to concrete types, have custom deserializer, or be instantiated with
additional type information
* I tried using Jackson’s DefaultScalaModule, which seems to have support
for Scala Seqs, but got no luck.
* Deserialization works if the Scala class does not have any Seq fields, and
works if the fields are Java Lists instead of Seqs.
Thanks very much for your help!
Kevin Chen





smime.p7s
Description: S/MIME cryptographic signature


Re: Deserializing JSON into Scala objects in Java code

2015-09-08 Thread Kevin Chen
Hi Marcelo,

 Thanks for the quick response. I understand that I can just write my own
Java classes (I will use that as a fallback option), but in order to avoid
code duplication and further possible changes, I was hoping there would be
a way to use the Spark API classes directly, since it seems there should
be.

 I registered the Scala module in the same way (except in Java instead of
Scala),

mapper.registerModule(new DefaultScalaModule());

But I don’t think the module is being used/registered properly? Do you
happen to know whether the above line should work in Java?



On 9/8/15, 12:55 PM, "Marcelo Vanzin"  wrote:

>Hi Kevin,
>
>How did you try to use the Scala module? Spark has this code when
>setting up the ObjectMapper used to generate the output:
>
>  
>mapper.registerModule(com.fasterxml.jackson.module.scala.DefaultScalaModul
>e)
>
>As for supporting direct serialization to Java objects, I don't think
>that was the goal of the API. The Scala API classes are public mostly
>so that API compatibility checks are performed against them. If you
>don't mind the duplication, you could write your own Java POJOs that
>mirror the Scala API, and use them to deserialize the JSON.
>
>
>On Tue, Sep 8, 2015 at 12:46 PM, Kevin Chen  wrote:
>> Hello Spark Devs,
>>
>>  I am trying to use the new Spark API json endpoints at /api/v1/[path]
>> (added in SPARK-3454).
>>
>>  In order to minimize maintenance on our end, I would like to use
>> Retrofit/Jackson to parse the json directly into the Scala classes in
>> org/apache/spark/status/api/v1/api.scala (ApplicationInfo,
>> ApplicationAttemptInfo, etc…). However, Jackson does not seem to know
>>how to
>> handle Scala Seqs, and will throw an error when trying to parse the
>> attempts: Seq[ApplicationAttemptInfo] field of ApplicationInfo. Our
>>codebase
>> is in Java.
>>
>>  My questions are:
>>
>> Do you have any recommendations on how to easily deserialize Scala
>>objects
>> from json? For example, do you have any current usage examples of
>>SPARK-3454
>> with Java?
>> Alternatively, are you committed to the json formats of /api/v1/path? I
>> would guess so, because of the ‘v1’, but wanted to confirm. If so, I
>>could
>> deserialize the json into instances of my own Java classes instead,
>>without
>> worrying about changing the class structure later due to changes in the
>> Spark API.
>>
>> Some further information:
>>
>> The error I am getting with Jackson when trying to deserialize the json
>>into
>> ApplicationInfo is Caused by:
>> com.fasterxml.jackson.databind.JsonMappingException: Can not construct
>> instance of scala.collection.Seq, problem: abstract types either need
>>to be
>> mapped to concrete types, have custom deserializer, or be instantiated
>>with
>> additional type information
>> I tried using Jackson’s DefaultScalaModule, which seems to have support
>>for
>> Scala Seqs, but got no luck.
>> Deserialization works if the Scala class does not have any Seq fields,
>>and
>> works if the fields are Java Lists instead of Seqs.
>>
>> Thanks very much for your help!
>> Kevin Chen
>>
>
>
>
>-- 
>Marcelo


smime.p7s
Description: S/MIME cryptographic signature


Re: Deserializing JSON into Scala objects in Java code

2015-09-09 Thread Kevin Chen
Marcelo and Christopher,

 Thanks for your help! The problem turned out to arise from a different part
of the code (we have multiple ObjectMappers), but because I am not very
familiar with Jackson I had thought there was a problem with the Scala
module.

Thank you again,
Kevin

From:  Christopher Currie 
Date:  Wednesday, September 9, 2015 at 10:17 AM
To:  Kevin Chen , "dev@spark.apache.org"

Cc:  Matt Cheah , Mingyu Kim 
Subject:  Fwd: Deserializing JSON into Scala objects in Java code

Kevin,

I'm not a Spark dev, but I maintain the Scala module for Jackson. If you're
continuing to have issues with parsing JSON using the Spark Scala datatypes,
let me know or chime in on the jackson mailing list
(jackson-u...@googlegroups.com) and I'll see what I can do to help.

Christopher Currie

-- Forwarded message --
From: Paul Brown 
Date: Tue, Sep 8, 2015 at 8:58 PM
Subject: Fwd: Deserializing JSON into Scala objects in Java code
To: Christopher Currie 


Passing along. 

-- Forwarded message --
From: Kevin Chen 
Date: Tuesday, September 8, 2015
Subject: Deserializing JSON into Scala objects in Java code
To: "dev@spark.apache.org" 
Cc: Matt Cheah , Mingyu Kim 


Hello Spark Devs,

 I am trying to use the new Spark API json endpoints at /api/v1/[path]
(added in SPARK-3454).

 In order to minimize maintenance on our end, I would like to use
Retrofit/Jackson to parse the json directly into the Scala classes in
org/apache/spark/status/api/v1/api.scala (ApplicationInfo,
ApplicationAttemptInfo, etc…). However, Jackson does not seem to know how to
handle Scala Seqs, and will throw an error when trying to parse the
attempts: Seq[ApplicationAttemptInfo] field of ApplicationInfo. Our codebase
is in Java.

 My questions are:
1. Do you have any recommendations on how to easily deserialize Scala
objects from json? For example, do you have any current usage examples of
SPARK-3454 with Java?
2. Alternatively, are you committed to the json formats of /api/v1/path? I
would guess so, because of the ‘v1’, but wanted to confirm. If so, I could
deserialize the json into instances of my own Java classes instead, without
worrying about changing the class structure later due to changes in the
Spark API.
Some further information:
* The error I am getting with Jackson when trying to deserialize the json
into ApplicationInfo is Caused by:
com.fasterxml.jackson.databind.JsonMappingException: Can not construct
instance of scala.collection.Seq, problem: abstract types either need to be
mapped to concrete types, have custom deserializer, or be instantiated with
additional type information
* I tried using Jackson’s DefaultScalaModule, which seems to have support
for Scala Seqs, but got no luck.
* Deserialization works if the Scala class does not have any Seq fields, and
works if the fields are Java Lists instead of Seqs.
Thanks very much for your help!
Kevin Chen




-- 
(Sent from mobile. Pardon brevity.)





smime.p7s
Description: S/MIME cryptographic signature


New Spark json endpoints

2015-09-11 Thread Kevin Chen
Hello Spark Devs,

 I noticed that [SPARK-3454], which introduces new json endpoints at
/api/v1/[path] for information previously only shown on the web UI, does not
expose several useful properties about Spark jobs that are exposed on the
web UI and on the unofficial /json endpoint.

 Specific examples include the maximum number of allotted cores per
application, amount of memory allotted to each slave, and number of cores
used by each worker. These are provided at ‘app.cores, app.memoryperslave,
and worker.coresused’ in the /json endpoint, and also all appear on the web
UI page.

 Is there any specific reason that these fields are not exposed in the
public API? If not, would it be reasonable to add them to the json blobs,
possibly in a future /api/v2 API?

Thank you,
Kevin Chen





smime.p7s
Description: S/MIME cryptographic signature


Re: New Spark json endpoints

2015-09-16 Thread Kevin Chen
Just wanted to bring this email up again in case there were any thoughts.
Having all the information from the web UI accessible through a supported
json API is very important to us; are there any objections to us adding a v2
API to Spark?

Thanks!

From:  Kevin Chen 
Date:  Friday, September 11, 2015 at 11:30 AM
To:  "dev@spark.apache.org" 
Cc:  Matt Cheah , Mingyu Kim 
Subject:  New Spark json endpoints

Hello Spark Devs,

 I noticed that [SPARK-3454], which introduces new json endpoints at
/api/v1/[path] for information previously only shown on the web UI, does not
expose several useful properties about Spark jobs that are exposed on the
web UI and on the unofficial /json endpoint.

 Specific examples include the maximum number of allotted cores per
application, amount of memory allotted to each slave, and number of cores
used by each worker. These are provided at ‘app.cores, app.memoryperslave,
and worker.coresused’ in the /json endpoint, and also all appear on the web
UI page.

 Is there any specific reason that these fields are not exposed in the
public API? If not, would it be reasonable to add them to the json blobs,
possibly in a future /api/v2 API?

Thank you,
Kevin Chen





smime.p7s
Description: S/MIME cryptographic signature


Re: New Spark json endpoints

2015-09-17 Thread Kevin Chen
Thank you all for the feedback. I’ve created a corresponding JIRA ticket at
https://issues.apache.org/jira/browse/SPARK-10565, updated with a summary of
this thread.

From:  Mark Hamstra 
Date:  Thursday, September 17, 2015 at 8:00 AM
To:  Imran Rashid 
Cc:  Kevin Chen , "dev@spark.apache.org"
, Matt Cheah , Mingyu Kim

Subject:  Re: New Spark json endpoints

While we're at it, adding endpoints that get results by jobGroup (cf.
SparkContext#setJobGroup) instead of just for a single Job would also be
very useful to some of us.

On Thu, Sep 17, 2015 at 7:30 AM, Imran Rashid  wrote:
> Hi Kevin, 
> 
> I think it would be great if you added this.  It never got added in the first
> place b/c the original PR was already pretty bloated, and just never got back
> to this.  I agree with Reynold -- you shouldn't need to increase the version
> for just adding new endpoints (or even adding new fields to existing
> endpoints).  See the guarantees we make here:
> 
> http://spark.apache.org/docs/latest/monitoring.html#rest-api
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_lat
> est_monitoring.html-23rest-2Dapi&d=BQMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4t
> Fb6oOnmz8&r=LsE5OY2DgDurlIyKnSpM6-DfP7v9PXYudc_RKz6BHxk&m=9LefIyuh9-7WnXvvb-5y
> xOvm0d8y18FKNG7D834_kNA&s=Ov1ONemrf4hSALvJBLRhykzEIhhb0ZijAFQ6zdlunQk&e=>
> 
> (Though if you think we should make different guarantees around versions, that
> would be worth discussing as well.)
> 
> Can you file a jira, and we move discussion there?  Please cc me, and maybe
> also Josh Rosen (I'm not sure if he has cycles now but he's been very helpful
> on these issues in the past).
> 
> thanks,
> Imran
> 
> 
> On Wed, Sep 16, 2015 at 9:10 PM, Kevin Chen  wrote:
>> Just wanted to bring this email up again in case there were any thoughts.
>> Having all the information from the web UI accessible through a supported
>> json API is very important to us; are there any objections to us adding a v2
>> API to Spark?
>> 
>> Thanks!
>> 
>> From: Kevin Chen 
>> Date: Friday, September 11, 2015 at 11:30 AM
>> To: "dev@spark.apache.org" 
>> Cc: Matt Cheah , Mingyu Kim 
>> Subject: New Spark json endpoints
>> 
>> Hello Spark Devs,
>> 
>>  I noticed that [SPARK-3454], which introduces new json endpoints at
>> /api/v1/[path] for information previously only shown on the web UI, does not
>> expose several useful properties about Spark jobs that are exposed on the web
>> UI and on the unofficial /json endpoint.
>> 
>>  Specific examples include the maximum number of allotted cores per
>> application, amount of memory allotted to each slave, and number of cores
>> used by each worker. These are provided at ‘app.cores, app.memoryperslave,
>> and worker.coresused’ in the /json endpoint, and also all appear on the web
>> UI page.
>> 
>>  Is there any specific reason that these fields are not exposed in the public
>> API? If not, would it be reasonable to add them to the json blobs, possibly
>> in a future /api/v2 API?
>> 
>> Thank you,
>> Kevin Chen
>> 
> 





smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-03-11 Thread Kevin Markey
Pardon my late entry into the fray, here, but we've just struggled 
though some library conflicts that could have been avoided and whose 
story shed some light on this question.


We have been integrating Spark with a number of other components. We 
discovered several conflicts, most easily eliminated.  But the ASM 
conflicts were not quite so easy to handle because of ASM's API changes 
between 3.x and 4.x (most usually seen first in ClassVisitor which was 
an interface and now is an abstract class).


The spark-core_2.10 has a transitive dependency on 4.0.  Hive, Hadoop, 
various Java EE servlets, and other libraries have transitive 
dependencies on 3.2 or earlier.  In one of the applications we are 
developing, there are 10 libraries with ASM dependencies.  Five are 
well-behaved, having shaded ASM.  Another five, are poorly behaved, not 
shading it.  The ASM FAQ specifically recommends shading ASM in any tool 
or framework which contains it: http://asm.ow2.org/doc/faq.html#Q15.


ASM has been shaded in the SBT build since June 2013.  However, it was 
not properly shaded in the Maven build until last week.  As result, 
libraries such as spark-core_2.10 pushed to Maven Central haven't 
reflected the SBT build.  This is documented in Jira SPARK-782: 
https://spark-project.atlassian.net/browse/SPARK-782


We cannot use SBT for our overall project.  Maven is our standard. 
Hence, we are dependent on Maven Central and libraries mirrored by our 
corporate repository.


In this context, if both builds are maintained, then they need to have 
the same functionality.


If only one build must be retained, it should be Maven because Maven and 
other tools that use Maven Central are more likely to be used for large 
project integrations.  Also for this reason, the Maven build should be 
given more priority than at present.  It seems a bit odd, if a Maven 
project can be automatically generated from SBT, that it would take 1 
year for ASM shading in Maven to catch up with SBT.


Thanks
Kevin Markey


SBT appears to have syntax for both, just like Maven. Surely these
have the same meanings in SBT, and excluding artifacts is accomplished
with exclude and excludeAll, as seen in the Spark build?

The assembly and shader stuff in Maven is more about controlling
exactly how it's put together into an artifact, at the level of files
even, to stick a license file in or exclude some data file cruft or
rename dependencies.

exclusions and shading are necessary evils to be used as sparingly as
possible. Dependency graphs get nuts fast here, and Spark is already
quite big. (Hence my recent PR to start touching it up -- more coming
for sure.)





Re: Spark 0.9.1 release

2014-03-24 Thread Kevin Markey

1051 is essential!
I'm not sure about the others, but anything that adds stability to 
Spark/Yarn would  be helpful.

Kevin Markey


On 03/20/2014 01:12 PM, Tom Graves wrote:

I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on 
YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as submitting user 
- JIRA in.  The pyspark one I would consider more of an enhancement so might 
not be appropriate for a point release.

  
  [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA...

org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
 at org.apache.spark.schedule...
View on spark-project.atlassian.net Preview by Yahoo
  
  
  [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA

This means that they can't write/read from files that the yarn user doesn't 
have permissions to but the submitting user does.
View on spark-project.atlassian.net Preview by Yahoo
  
  




On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta  wrote:
  
It will be great if

"SPARK-1101<https://spark-project.atlassian.net/browse/SPARK-1101>:
Umbrella
for hardening Spark on YARN" can get into 0.9.1.

Thanks,
Bhaskar


On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
wrote:


   Hello everyone,

Since the release of Spark 0.9, we have received a number of important bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
going to cut a release candidate soon and we would love it if people test
it out. We have backported several bug fixes into the 0.9 and updated JIRA
accordingly<
https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)

.

Please let me know if there are fixes that were not backported but you
would like to see them in 0.9.1.

Thanks!

TD





Re: Spark 0.9.1 release

2014-03-24 Thread Kevin Markey
Is there any way that [SPARK-782] (Shade ASM) can be included?  I see 
that it is not currently backported to 0.9.  But there is no single 
issue that has caused us more grief as we integrate spark-core with 
other project dependencies.  There are way too many libraries out there 
in addition to Spark 0.9 and before that are not well-behaved (ASM FAQ 
recommends shading), including some Hive and Hadoop libraries and a 
number of servlet libraries.  We can't control those, but if Spark were 
well behaved in this regard, it would help.  Even for a maintenance 
release, and even if 1.0 is only 6 weeks away!


(For those not following 782, according to Jira comments, the SBT build 
shades it, but it is the Maven build that ends up in Maven Central.)


Thanks
Kevin Markey



On 03/19/2014 06:07 PM, Tathagata Das wrote:

  Hello everyone,

Since the release of Spark 0.9, we have received a number of important bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
going to cut a release candidate soon and we would love it if people test
it out. We have backported several bug fixes into the 0.9 and updated JIRA
accordingly<https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)>.
Please let me know if there are fixes that were not backported but you
would like to see them in 0.9.1.

Thanks!

TD





Re: Spark 0.9.1 release

2014-03-25 Thread Kevin Markey

TD:

A correct shading of ASM should only affect Spark code unless someone is 
relying on ASM 4.0 in unrelated project code, in which case they can add 
org.ow2.asm:asm:4.x as a dependency.


Our short term solution has been to repackage other libraries with a 3.2 
dependency or to exclude ASM when our use of a dependent library really 
doesn't need it.  As you probably know, the real problem arises in 
ClassVisitor, which is an Interface in 3.x and before, but in 4.x it is 
an abstract class that takes a version constant as its constructor.  The 
ASM folks of course had our best interests in mind when they did this, 
attempting to deal with the Java-version dependent  changes from one ASM 
release to the next.  Unfortunately, they didn't change the names or 
locations of their classes and interfaces, which would have helped.


In our particular case, the only library from which we couldn't exclude 
ASM was 
org.glassfish.jersey.containers:jersey-container-servlet:jar:2.5.1. I 
added a new module to our project, including some dummy source code, 
because we needed the library to be self contained, made the servlet -- 
minus some unrelated transitive dependencies -- the only module 
dependency, then used the Maven shade plugin to relocate 
"org.objectweb.asm" to an arbitrary target.  We added the new shaded 
module as a new project dependency, plus the unrelated transitive 
dependencies excluded above.   This solved the problem. At least until 
we added WADL to the project.  Then we needed to deal with it on its own 
terms.


As you can see, we left Spark alone in all its ASM 4.0 glory.  Why? 
Spark is more volatile than the other libraries.  Also, the way in which 
we needed to deploy Spark and other resources on our (Yarn) clusters 
suggested that it would be easier to shade the other libraries.  I 
wanted to avoid having to install a locally patched Spark library into 
our build, updating the cluster and individual developers whenever 
there's a new patch.  Individual developers such as me who are testing 
the impact of patches can handle it, but the main build goes to Maven 
Central via our corporate Artifactory mirror.


If suddenly we had a Spark 0.9.1 with a shaded ASM, it would have no 
negative impact on us.  Only a positive impact.


I just wish that all users of ASM would read FAQ entry 15!!!

Thanks
Kevin


On 03/24/2014 06:30 PM, Tathagata Das wrote:

Hello Kevin,

A fix for SPARK-782 would definitely simplify building against Spark.
However, its possible that a fix for this issue in 0.9.1 will break
the builds (that reference spark) of existing 0.9 users, either due to
a change in the ASM version, or for being incompatible with their
current workarounds for this issue. That is not a good idea for a
maintenance release, especially when 1.0 is not too far away.

Can you (and others) elaborate more on the current workarounds that
you have for this issue? Its best to understand all the implications
of this fix.

Note that in branch 0.9, it is not fixed, neither in SBT nor in Maven.

TD

On Mon, Mar 24, 2014 at 4:38 PM, Kevin Markey  wrote:

Is there any way that [SPARK-782] (Shade ASM) can be included?  I see that
it is not currently backported to 0.9.  But there is no single issue that
has caused us more grief as we integrate spark-core with other project
dependencies.  There are way too many libraries out there in addition to
Spark 0.9 and before that are not well-behaved (ASM FAQ recommends shading),
including some Hive and Hadoop libraries and a number of servlet libraries.
We can't control those, but if Spark were well behaved in this regard, it
would help.  Even for a maintenance release, and even if 1.0 is only 6 weeks
away!

(For those not following 782, according to Jira comments, the SBT build
shades it, but it is the Maven build that ends up in Maven Central.)

Thanks
Kevin Markey




On 03/19/2014 06:07 PM, Tathagata Das wrote:

   Hello everyone,

Since the release of Spark 0.9, we have received a number of important bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
going to cut a release candidate soon and we would love it if people test
it out. We have backported several bug fixes into the 0.9 and updated JIRA

accordingly<https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)>.

Please let me know if there are fixes that were not backported but you
would like to see them in 0.9.1.

Thanks!

TD





Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-31 Thread Kevin Markey
I had specifically requested that the ASM shading be included in the RC, 
hence my testing focused on that, but I ran other tests as well.  Tested 
with a build of our project, running one of our applications from that 
build in yarn-standalone on a pseudocluster, and successfully 
redeploying and bringing up a web app that is integrated with Spark.  It 
is the latter where most ASM conflicts have typically occurred.  
Successful build and passed both tests. So, my vote:


+1

One test which I'd like to run but can't because of unrelated library 
conflicts would have been to remove various ASM exclusions from other 
libraries, recompiling and redeploying.  But I'd incur the wrath of the 
rest of my team doing that, especially after a full day of tracking down 
yet another (totally unrelated) library conflict.


Thanks for this maintenance release.

Kevin Markey


On 03/31/2014 12:32 PM, Tathagata Das wrote:

Yes, lets extend the vote for two more days from now. So the vote is open
till *Wednesday, April 02, at 20:00 UTC*

On that note, my +1

TD




On Mon, Mar 31, 2014 at 9:57 AM, Patrick Wendell  wrote:


Yeah good point. Let's just extend this vote another few days?


On Mon, Mar 31, 2014 at 8:12 AM, Tom Graves  wrote:


I should probably pull this off into another thread, but going forward

can

we try to not have the release votes end on a weekend? Since we only seem
to give 3 days, it makes it really hard for anyone who is offline for the
weekend to try it out.   Either that or extend the voting for more then 3
days.

Tom
On Monday, March 31, 2014 12:50 AM, Patrick Wendell 
wrote:

TD - I downloaded and did some local testing. Looks good to me!

+1

You should cast your own vote - at that point it's enough to pass.

- Patrick



On Sun, Mar 30, 2014 at 9:47 PM, prabeesh k 

wrote:

+1
tested on Ubuntu12.04 64bit


On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia <

matei.zaha...@gmail.com

wrote:
+1 tested on Mac OS X.

Matei

On Mar 27, 2014, at 1:32 AM, Tathagata Das <

tathagata.das1...@gmail.com>

wrote:


Please vote on releasing the following candidate as Apache Spark

version

0.9.1

A draft of the release notes along with the CHANGES.txt file is
attached to this e-mail.

The tag to be voted on is v0.9.1-rc3 (commit 4c43182b):


https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208

The release files, including signatures, digests, etc. can be found

at:

http://people.apache.org/~tdas/spark-0.9.1-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:


https://repository.apache.org/content/repositories/orgapachespark-1009/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/

Please vote on releasing this package as Apache Spark 0.9.1!

The vote is open until Sunday, March 30, at 10:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 0.9.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/







Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-21 Thread Kevin Markey

0

Abstaining because I'm not sure if my failures are due to Spark, 
configuration, or other factors...


Compiled and deployed RC10 for YARN, Hadoop 2.3 per Spark 1.0.0 Yarn 
documentation.  No problems.
Rebuilt applications against RC10 and Hadoop 2.3.0 (plain vanilla Apache 
release).

Updated scripts for various applications.
Application had successfully compiled and run against Spark 0.9.1 and 
Hadoop 2.3.0.

Ran in "yarn-cluster" mode.
Application ran to conclusion except that it ultimately failed because 
of an exception when Spark tried to clean up the staging directory.  
Also, where before Yarn would report the running program as "RUNNING", 
it only reported this application as "ACCEPTED".  It appeared to run two 
containers when the first instance never reported that it was RUNNING.


I will post a separate note to the USER list about the specifics.

Thanks
Kevin Markey


On 05/21/2014 10:58 AM, Mark Hamstra wrote:

+1


On Tue, May 20, 2014 at 11:09 PM, Henry Saputra wrote:


Signature and hash for source looks good
No external executable package with source - good
Compiled with git and maven - good
Ran examples and sample programs locally and standalone -good

+1

- Henry



On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
 wrote:

Please vote on releasing the following candidate as Apache Spark version

1.0.0!

This has a few bug fixes on top of rc9:
SPARK-1875: https://github.com/apache/spark/pull/824
SPARK-1876: https://github.com/apache/spark/pull/819
SPARK-1878: https://github.com/apache/spark/pull/822
SPARK-1879: https://github.com/apache/spark/pull/823

The tag to be voted on is v1.0.0-rc10 (commit d8070234):


https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10/

The release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1018/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

The full list of changes in this release can be found at:


https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Friday, May 23, at 20:00 UTC and passes if
amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:


http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10

Changes to the Java API:


http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

Changes to the streaming API:


http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

Changes to the GraphX API:


http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

Other changes:
coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior




Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Kevin Markey
I've discovered that one of the anomalies I encountered was due to a 
(embarrassing? humorous?) user error.  See the user list thread "Failed 
RC-10 yarn-cluster job for FS closed error when cleaning up staging 
directory" for my discussion.  With the user error corrected, the FS 
closed exception only prevents deletion of the staging directory, but 
does not affect completion with "SUCCESS." The FS closed exception still 
needs some investigation at least by me.


I tried the patch reported by SPARK-1898, but it didn't fix the problem 
without fixing the user error.  I did not attempt to test my fix without 
the patch, so I can't pass judgment on the patch.


Although this is merely a pseudocluster based test -- I can't 
reconfigure our cluster with RC-10 -- I'll now change my vote to...


+1.

Thanks all who helped.
Kevin



On 05/21/2014 09:18 PM, Tom Graves wrote:

I don't think Kevin's issue would be with an api change in YarnClientImpl since 
in both cases he says he is using hadoop 2.3.0.  I'll take a look at his post 
in the user list.

Tom




On Wednesday, May 21, 2014 7:01 PM, Colin McCabe  wrote:
  



Hi Kevin,

Can you try https://issues.apache.org/jira/browse/SPARK-1898 to see if it
fixes your issue?

Running in YARN cluster mode, I had a similar issue where Spark was able to
create a Driver and an Executor via YARN, but then it stopped making any
progress.

Note: I was using a pre-release version of CDH5.1.0, not 2.3 like you were
using.

best,
Colin



On Wed, May 21, 2014 at 3:34 PM, Kevin Markey wrote:


0

Abstaining because I'm not sure if my failures are due to Spark,
configuration, or other factors...

Compiled and deployed RC10 for YARN, Hadoop 2.3

  per Spark 1.0.0 Yarn

documentation.  No problems.
Rebuilt applications against RC10 and Hadoop 2.3.0 (plain vanilla Apache
release).
Updated scripts for various applications.
Application had successfully compiled and run against Spark 0.9.1 and
Hadoop 2.3.0.
Ran in "yarn-cluster" mode.
Application ran to conclusion except that it ultimately failed because of
an exception when Spark tried to clean up the staging directory.  Also,
where before Yarn would report the running program as "RUNNING", it only
reported this application as "ACCEPTED".  It appeared to run two containers
when the first instance never reported that it was RUNNING.

I will post a

  separate note to the USER list about the specifics.

Thanks
Kevin Markey



On 05/21/2014 10:58 AM, Mark Hamstra wrote:


+1


On Tue, May 20, 2014 at 11:09 PM, Henry Saputra 
wrote:

   Signature and hash for source looks good

No external executable package with source - good
Compiled with git and maven - good
Ran examples and sample programs locally and standalone -good

+1

- Henry



On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
 wrote:


Please vote on releasing the following candidate as Apache Spark version


1.0.0!


This has a few bug fixes on top of rc9:
SPARK-1875: https://github.com/apache/spark/pull/824
SPARK-1876: https://github.com/apache/spark/pull/819
SPARK-1878: https://github.com/apache/spark/pull/822
SPARK-1879: https://github.com/apache/spark/pull/823

The tag to be voted on is v1.0.0-rc10 (commit d8070234):

   https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=

d807023479ce10aec28ef3c1ab646ddefc2e663c


The

  release files, including signatures, digests, etc. can be found at:

http://people.apache.org/~tdas/spark-1.0.0-rc10/

The release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1018/

The documentation

  corresponding to this release can be found at:

http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

The full list of changes in this release can be found at:

   https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;

f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=
d807023479ce10aec28ef3c1ab646ddefc2e663c


Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until

  Friday, May 23, at 20:00 UTC and passes if

amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:

   http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

mllib-guide.html#from-09-to-10


Changes to the Java API:

   http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

java-programming-guide.html#upgrading-from-pre-10-versions-of

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Kevin Markey

I retested several different cases...

1. FS closed exception shows up ONLY in RC-10, not in Spark 0.9.1, with 
both Hadoop 2.2 and 2.3.

2. SPARK-1898 has no effect for my use cases.
3. The failure to report that the underlying application is "RUNNING" 
and that it has succeeded is due ONLY to my user error.


The FS closed exception only effects the cleanup of the staging 
directory, not the final success or failure.  I've not yet tested the 
effect of changing my application's initialization, use, or closing of 
FileSystem.


Thanks again.
Kevin

On 05/22/2014 01:32 AM, Kevin Markey wrote:
I've discovered that one of the anomalies I encountered was due to a 
(embarrassing? humorous?) user error.  See the user list thread 
"Failed RC-10 yarn-cluster job for FS closed error when cleaning up 
staging directory" for my discussion.  With the user error corrected, 
the FS closed exception only prevents deletion of the staging 
directory, but does not affect completion with "SUCCESS." The FS 
closed exception still needs some investigation at least by me.


I tried the patch reported by SPARK-1898, but it didn't fix the 
problem without fixing the user error.  I did not attempt to test my 
fix without the patch, so I can't pass judgment on the patch.


Although this is merely a pseudocluster based test -- I can't 
reconfigure our cluster with RC-10 -- I'll now change my vote to...


+1.

Thanks all who helped.
Kevin



On 05/21/2014 09:18 PM, Tom Graves wrote:
I don't think Kevin's issue would be with an api change in 
YarnClientImpl since in both cases he says he is using hadoop 2.3.0.  
I'll take a look at his post in the user list.


Tom




On Wednesday, May 21, 2014 7:01 PM, Colin McCabe 
 wrote:



Hi Kevin,

Can you try https://issues.apache.org/jira/browse/SPARK-1898 to see 
if it

fixes your issue?

Running in YARN cluster mode, I had a similar issue where Spark was 
able to

create a Driver and an Executor via YARN, but then it stopped making any
progress.

Note: I was using a pre-release version of CDH5.1.0, not 2.3 like you 
were

using.

best,
Colin



On Wed, May 21, 2014 at 3:34 PM, Kevin Markey 
wrote:



0

Abstaining because I'm not sure if my failures are due to Spark,
configuration, or other factors...

Compiled and deployed RC10 for YARN, Hadoop 2.3

  per Spark 1.0.0 Yarn

documentation.  No problems.
Rebuilt applications against RC10 and Hadoop 2.3.0 (plain vanilla 
Apache

release).
Updated scripts for various applications.
Application had successfully compiled and run against Spark 0.9.1 and
Hadoop 2.3.0.
Ran in "yarn-cluster" mode.
Application ran to conclusion except that it ultimately failed 
because of

an exception when Spark tried to clean up the staging directory.  Also,
where before Yarn would report the running program as "RUNNING", it 
only
reported this application as "ACCEPTED".  It appeared to run two 
containers

when the first instance never reported that it was RUNNING.

I will post a

  separate note to the USER list about the specifics.

Thanks
Kevin Markey



On 05/21/2014 10:58 AM, Mark Hamstra wrote:


+1


On Tue, May 20, 2014 at 11:09 PM, Henry Saputra 


wrote:

   Signature and hash for source looks good

No external executable package with source - good
Compiled with git and maven - good
Ran examples and sample programs locally and standalone -good

+1

- Henry



On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
 wrote:

Please vote on releasing the following candidate as Apache Spark 
version



1.0.0!


This has a few bug fixes on top of rc9:
SPARK-1875: https://github.com/apache/spark/pull/824
SPARK-1876: https://github.com/apache/spark/pull/819
SPARK-1878: https://github.com/apache/spark/pull/822
SPARK-1879: https://github.com/apache/spark/pull/823

The tag to be voted on is v1.0.0-rc10 (commit d8070234):

https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=

d807023479ce10aec28ef3c1ab646ddefc2e663c


The

  release files, including signatures, digests, etc. can be found at:

http://people.apache.org/~tdas/spark-1.0.0-rc10/

The release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1018/ 



The documentation

  corresponding to this release can be found at:

http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

The full list of changes in this release can be found at:

https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;

f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=
d807023479ce10aec28ef3c1ab646ddefc2e663c


Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until

  Friday, May 23, at 20:00 UTC and passes if

amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Kevin Markey

Thank you, all!  This is quite helpful.

We have been arguing how to handle this issue across a growing 
application.  Unfortunately the Hadoop FileSystem java doc should say 
all this but doesn't!


Kevin

On 05/22/2014 01:48 PM, Aaron Davidson wrote:

In Spark 0.9.0 and 0.9.1, we stopped using the FileSystem cache correctly,
and we just recently resumed using it in 1.0 (and in 0.9.2) when this issue
was fixed: https://issues.apache.org/jira/browse/SPARK-1676

Prior to this fix, each Spark task created and cached its own FileSystems
due to a bug in how the FS cache handles UGIs. The big problem that arose
was that these FileSystems were never closed, so they just kept piling up.
There were two solutions we considered, with the following effects: (1)
Share the FS cache among all tasks and (2) Each task effectively gets its
own FS cache, and closes all of its FSes after the task completes.

We chose solution (1) for 3 reasons:
  - It does not rely on the behavior of a bug in HDFS.
  - It is the most performant option.
  - It is most consistent with the semantics of the (albeit broken) FS cache.

Since this behavior was changed in 1.0, it could be considered a
regression. We should consider the exact behavior we want out of the FS
cache. For Spark's purposes, it seems fine to cache FileSystems across
tasks, as Spark does not close FileSystems. The issue that comes up is that
user code which uses FileSystem.get() but then closes the FileSystem can
screw up Spark processes which were using that FileSystem. The workaround
for users would be to use FileSystem.newInstance() if they want full
control over the lifecycle of their FileSystems.


On Thu, May 22, 2014 at 12:06 PM, Colin McCabe wrote:


The FileSystem cache is something that has caused a lot of pain over the
years.  Unfortunately we (in Hadoop core) can't change the way it works now
because there are too many users depending on the current behavior.

Basically, the idea is that when you request a FileSystem with certain
options with FileSystem#get, you might get a reference to an FS object that
already exists, from our FS cache cache singleton.  Unfortunately, this
also means that someone else can change the working directory on you or
close the FS underneath you.  The FS is basically shared mutable state, and
you don't know whom you're sharing with.

It might be better for Spark to call FileSystem#newInstance, which bypasses
the FileSystem cache and always creates a new object.  If Spark can hang on
to the FS for a while, it can get the benefits of caching without the
downsides.  In HDFS, multiple FS instances can also share things like the
socket cache between them.

best,
Colin


On Thu, May 22, 2014 at 10:06 AM, Marcelo Vanzin 
wrote:
Hi Kevin,

On Thu, May 22, 2014 at 9:49 AM, Kevin Markey 
wrote:

The FS closed exception only effects the cleanup of the staging

directory,

not the final success or failure.  I've not yet tested the effect of
changing my application's initialization, use, or closing of

FileSystem.

Without going and reading more of the Spark code, if your app is
explicitly close()'ing the FileSystem instance, it may be causing the
exception. If Spark is caching the FileSystem instance, your app is
probably closing that same instance (which it got from the HDFS
library's internal cache).

It would be nice if you could test that theory; it might be worth
knowing that's the case so that we can tell people not to do that.

--
Marcelo





Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Kevin Markey

+1

Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
Ran current version of one of my applications on 1-node pseudocluster 
(sorry, unable to test on full cluster).

yarn-cluster mode
Ran regression tests.

Thanks
Kevin

On 05/28/2014 09:55 PM, Krishna Sankar wrote:

+1
Pulled & built on MacOS X, EC2 Amazon Linux
Ran test programs on OS X, 5 node c3.4xlarge cluster
Cheers



On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski wrote:


+1
On May 28, 2014 7:05 PM, "Xiangrui Meng"  wrote:


+1

Tested apps with standalone client mode and yarn cluster and client

modes.

Xiangrui

On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
 wrote:

Pulled down, compiled, and tested examples on OS X and ubuntu.
Deployed app we are building on spark and poured data through it.

+1

Sean


On May 26, 2014, at 8:39 AM, Tathagata Das <

tathagata.das1...@gmail.com>

wrote:

Please vote on releasing the following candidate as Apache Spark

version 1.0.0!

This has a few important bug fixes on top of rc10:
SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
SPARK-1870: https://github.com/apache/spark/pull/848
SPARK-1897: https://github.com/apache/spark/pull/849

The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):


https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

The release files, including signatures, digests, etc. can be found

at:

http://people.apache.org/~tdas/spark-1.0.0-rc11/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:


https://repository.apache.org/content/repositories/orgapachespark-1019/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Thursday, May 29, at 16:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:


http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

Changes to the Java API:


http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

Changes to the streaming API:


http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

Changes to the GraphX API:


http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

Other changes:
coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior




Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Kevin Kim (Sangwoo)
Thanks for all of your hard work, Hyukjin and Sameer. Congratulations!!


2017년 8월 8일 (화) 오전 9:44, Hyukjin Kwon 님이 작성:

> Thank you all. Will do my best!
>
> 2017-08-08 8:53 GMT+09:00 Holden Karau :
>
>> Congrats!
>>
>> On Mon, Aug 7, 2017 at 3:54 PM Bryan Cutler  wrote:
>>
>>> Great work Hyukjin and Sameer!
>>>
>>> On Mon, Aug 7, 2017 at 10:22 AM, Mridul Muralidharan 
>>> wrote:
>>>
 Congratulations Hyukjin, Sameer !

 Regards,
 Mridul

 On Mon, Aug 7, 2017 at 8:53 AM, Matei Zaharia 
 wrote:
 > Hi everyone,
 >
 > The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal
 as committers. Join me in congratulating both of them and thanking them for
 their contributions to the project!
 >
 > Matei
 > -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>