How to kill the spark job using Java API.

2015-11-20 Thread Hokam Singh Chauhan
Hi, I have been running the spark job on standalone spark cluster. I wants to kill the spark job using Java API. I am having the spark job name and spark job id. The REST POST call for killing the job is not working. If anyone explored it please help me out. -- Thanks and Regards, Hokam Singh

Error in Saving the MLlib models

2015-11-20 Thread hokam chauhan
Hi, I am exploring the MLlib. I have taken the examples of the MLlib and tried to train a SVM Model. I am getting the exception when i am saving the trained model.As i run the code in local mode it works fine, but when i run the MLlib example in standalone cluster mode it fails to save the Model.

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread varun sharma
I do this in my stop script to kill the application: kill -s SIGTERM `pgrep -f StreamingApp` to stop it forcefully : pkill -9 -f "StreamingApp" StreamingApp is name of class which I submitted. I also have shutdown hook thread to stop it gracefully. sys.ShutdownHookThread { logInfo("Gracefully s

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
I tried adding shutdown hook to my code but it didn't help. Still same issue On Fri, Nov 20, 2015 at 7:08 PM, Ted Yu wrote: > Which Spark release are you using ? > > Can you pastebin the stack trace of the process running on your machine ? > > Thanks > > On Nov 20, 2015, at 6:46 PM, Vikram Kone

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Ted Yu
Interesting, SPARK-3090 installs shutdown hook for stopping SparkContext. FYI On Fri, Nov 20, 2015 at 7:12 PM, Stéphane Verlet wrote: > I solved the first issue by adding a shutdown hook in my code. The > shutdown hook get call when you exit your script (ctrl-C , kill … but nor > kill -9) > > v

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Stéphane Verlet
I am not sure , I think it has to do with the signal sent to the process and how the JVM handles it Ctrl-C sends a a SIGINT vs a TERM signal for the kill command On Fri, Nov 20, 2015 at 8:21 PM, Vikram Kone wrote: > Thanks for the info Stephane. > Why does CTRL-C in the terminal running spark

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Thanks for the info Stephane. Why does CTRL-C in the terminal running spark-submit kills the app in spark master correctly w/o any explicit shutdown hooks in the code? Can you explain why we need to add the shutdown hook to kill it when launched via a shell script ? For the second issue, I'm not us

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Stéphane Verlet
I solved the first issue by adding a shutdown hook in my code. The shutdown hook get call when you exit your script (ctrl-C , kill … but nor kill -9) val shutdownHook = scala.sys.addShutdownHook { try { sparkContext.stop() //Make sure to kill any other threads or thread pool you may be ru

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Spark 1.4.1 On Friday, November 20, 2015, Ted Yu wrote: > Which Spark release are you using ? > > Can you pastebin the stack trace of the process running on your machine ? > > Thanks > > On Nov 20, 2015, at 6:46 PM, Vikram Kone > wrote: > > Hi, > I'm seeing a strange problem. I have a spark clu

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Ted Yu
Which Spark release are you using ? Can you pastebin the stack trace of the process running on your machine ? Thanks > On Nov 20, 2015, at 6:46 PM, Vikram Kone wrote: > > Hi, > I'm seeing a strange problem. I have a spark cluster in standalone mode. I > submit spark jobs from a remote node as

Initial State

2015-11-20 Thread Bryan
All, Is there a way to introduce an initial RDD without doing updateStateByKey? I have an initial set of counts, and the algorithm I am using requires that I accumulate additional counts from streaming data, age off older counts, and make some calculations on them. The accumulation of counts us

How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Hi, I'm seeing a strange problem. I have a spark cluster in standalone mode. I submit spark jobs from a remote node as follows from the terminal spark-submit --master spark://10.1.40.18:7077 --class com.test.Ping spark-jobs.jar when the app is running , when I press ctrl-C on the console termina

Re: updateStateByKey schedule time

2015-11-20 Thread Tathagata Das
For future readers of this thread, Spark 1.6 adds trackStateByKey that has native support for timeouts. On Tue, Jul 21, 2015 at 12:00 AM, Anand Nalya wrote: > I also ran into a similar use case. Is this possible? > > On 15 July 2015 at 18:12, Michel Hubert wrote: > >> Hi, >> >> >> >> >> >> I wa

Re: Does spark streaming write ahead log writes all received data to HDFS ?

2015-11-20 Thread Tathagata Das
Good question. Write Ahead Logs are used to do both - write data and write metadata when needed. When data wal is enabled using the conf spark.streaming.receiver.writeAheadLog.enable, data received by receivers are written to the data WAL by the executors. In some cases, like Direct Kafka and Kine

Re: Creating new Spark context when running in Secure YARN fails

2015-11-20 Thread Hari Shreedharan
Can you try this: https://github.com/apache/spark/pull/9875 . I believe this patch should fix the issue here. Thanks, Hari Shreedharan > On Nov 11, 2015, at 1:59 PM, Ted Yu wrote: > > Please take a look at > yarn/src/main/scala/org/apache/spark/d

Re: How to run two operations on the same RDD simultaneously

2015-11-20 Thread Ali Tajeldin EDU
You can try to use an Accumulator (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator) to keep count in map1. Note that the final count may be higher than the number of records if there were some retries along the way. -- Ali On Nov 20, 2015, at 3:38 PM, jlua

How to run two operations on the same RDD simultaneously

2015-11-20 Thread jluan
As far as I understand, operations on rdd's usually come in the form rdd => map1 => map2 => map2 => (maybe collect) If I would like to also count my RDD, is there any way I could include this at map1? So that as spark runs through map1, it also does a count? Or would count need to be a separate o

FW: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-11-20 Thread Mich Talebzadeh
From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: 20 November 2015 21:14 To: u...@hive.apache.org Subject: starting spark-shell throws /tmp/hive on HDFS should be writable error Hi, Has this been resolved. I don't think this has anything to do with /tmp/hive directory permission

question about combining small input splits

2015-11-20 Thread nezih
Hey everyone, I have a Hive table that has a lot of small parquet files and I am creating a data frame out of it to do some processing, but since I have a large number of splits/files my job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive provid

RE: Yarn Spark on EMR

2015-11-20 Thread Bozeman, Christopher
Suraj, Spark History server is running on 18080 (http://spark.apache.org/docs/latest/monitoring.html) which is not going to give you are real-time update on a running Spark application. Given this is Spark on YARN, you will need to view the Spark UI from the Application Master URL which can

RE: Spark Expand Cluster

2015-11-20 Thread Bozeman, Christopher
Dan, Even though you may be adding more nodes to the cluster, the Spark application has to be requesting additional executors in order to thus use the added resources. Or the Spark application can be using Dynamic Resource Allocation (http://spark.apache.org/docs/latest/job-scheduling.html#dyn

Corelation between 2 consecutive RDDs in Dstream

2015-11-20 Thread anshu shukla
1- Is there any wat=y to either make the pair of RDDs from a Dstream- Dstream ---> Dstream so that i can use already defined corelation function in spark. *Aim is to find auto-corelation value in spark .(As per my knowledge spark streaming does not support this.)* -- Thanks & Regards, Ansh

Does spark streaming write ahead log writes all received data to HDFS ?

2015-11-20 Thread kali.tumm...@gmail.com
Hi All, If write ahead logs are enabled in spark streaming does all the received data gets written to HDFS path ? or it only writes the metadata. How does clean up works , does HDFS path gets bigger and bigger up everyday do I need to write an clean up job to delete data from write ahead logs

Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Nevermind. I had a library dependency that still had the old Spark version. On Fri, Nov 20, 2015 at 2:14 PM, Bryan Jeffrey wrote: > The 1.5.2 Spark was compiled using the following options: mvn > -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive > -Phive-thriftserver clean package >

Re: getting different results from same line of code repeated

2015-11-20 Thread Ted Yu
Mind trying 1.5.2 release ? Thanks On Fri, Nov 20, 2015 at 10:56 AM, Walrus theCat wrote: > I'm running into all kinds of problems with Spark 1.5.1 -- does anyone > have a version that's working smoothly for them? > > On Fri, Nov 20, 2015 at 10:50 AM, Dean Wampler > wrote: > >> I didn't expect

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Cody Koeninger
You're confused about which parts of your code are running on the driver vs the executor, which is why you're getting serialization errors. Read http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Fri, Nov 20, 2015 at 1:07 PM, Saiph Kapp

Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
The 1.5.2 Spark was compiled using the following options: mvn -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive -Phive-thriftserver clean package Regards, Bryan Jeffrey On Fri, Nov 20, 2015 at 2:13 PM, Bryan Jeffrey wrote: > Hello. > > I'm seeing an error creating a Hive Context m

Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Hello. I'm seeing an error creating a Hive Context moving from Spark 1.4.1 to 1.5.2. Has anyone seen this issue? I'm invoking the following: new HiveContext(sc) // sc is a Spark Context I am seeing the following error: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding i

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Saiph Kappa
I think my problem persists whether I use Kafka or sockets. Or am I wrong? How would you use Kafka here? On Fri, Nov 20, 2015 at 7:12 PM, Christian wrote: > Have you considered using Kafka? > > On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa wrote: > >> Hi, >> >> I have a basic spark streaming appl

Re: getting different results from same line of code repeated

2015-11-20 Thread Walrus theCat
I'm running into all kinds of problems with Spark 1.5.1 -- does anyone have a version that's working smoothly for them? On Fri, Nov 20, 2015 at 10:50 AM, Dean Wampler wrote: > I didn't expect that to fail. I would call it a bug for sure, since it's > practically useless if this method doesn't wo

Re: Drop multiple columns in the DataFrame API

2015-11-20 Thread Ted Yu
Created PR: https://github.com/apache/spark/pull/9862 On Fri, Nov 20, 2015 at 10:17 AM, BenFradet wrote: > Hi everyone, > > I was wondering if there is a better way to drop mutliple columns from a > dataframe or why there is no drop(cols: Column*) method in the dataframe > API. > > Indeed, I ten

Drop multiple columns in the DataFrame API

2015-11-20 Thread BenFradet
Hi everyone, I was wondering if there is a better way to drop mutliple columns from a dataframe or why there is no drop(cols: Column*) method in the dataframe API. Indeed, I tend to write code like this: val filteredDF = df.drop("colA") .drop("colB") .drop("colC") //etc which is a bit

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Christian
Have you considered using Kafka? On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa wrote: > Hi, > > I have a basic spark streaming application like this: > > « > ... > > val ssc = new StreamingContext(sparkConf, Duration(batchMillis)) > val rawStreams = (1 to numStreams).map(_ => > ssc.rawSocketStrea

Re: newbie: unable to use all my cores and memory

2015-11-20 Thread Andy Davidson
Hi Igor Thanks . The reason I am using cluster mode is this the stream app must will run for ever. I am using client mode for my pyspark work Andy From: Igor Berman Date: Friday, November 20, 2015 at 6:22 AM To: Andrew Davidson Cc: "user @spark" Subject: Re: newbie: unable to use all my

how to use sc.hadoopConfiguration from pyspark

2015-11-20 Thread Tamas Szuromi
Hello, I've just wanted to use sc._jsc.hadoopConfiguration().set('key','value') in pyspark 1.5.2 but I got set method not exists error. Are there anyone who know a workaround to set some hdfs related properties like dfs.blocksize? Thanks in advance! Tamas

Data in one partition after reduceByKey

2015-11-20 Thread Patrick McGloin
Hi, I have Spark application which contains the following segment: val reparitioned = rdd.repartition(16) val filtered: RDD[(MyKey, myData)] = MyUtils.filter(reparitioned, startDate, endDate) val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, kv._2)) val reduced: RDD[(Da

Re: Save GraphX to disk

2015-11-20 Thread Ashish Rawat
Hi Todd, Could you please provide an example of doing this. Mazerunner seems to be doing something similar with Neo4j but it goes via hdfs and updates only the graph properties. Is there a direct way to do this with Neo4j or Titan? Regards, Ashish From: SLiZn Liu mailto:sliznmail...@gmail.com>

Re: newbie: unable to use all my cores and memory

2015-11-20 Thread Igor Berman
u've asked total cores to be 2 + 1 for driver(since you are running in cluster mode, so it's running on one of the slaves) change total cores to be 3*2 change submit mode to be client - you'll have full utilization (btw it's not advisable to use all cores of slave...since there is OS processes and

Re: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-20 Thread Igor Berman
try to assemble log4j.xml or log4j.properties in your jar...probably you'll get what you want, however pay attention that when you'll move to multinode cluster - there will be difference On 20 November 2015 at 05:10, Afshartous, Nick wrote: > > < log4j.properties file only exists on the master a

Re: How to control number of parquet files generated when using partitionBy

2015-11-20 Thread glennie
Turns out that calling repartition(numberOfParquetFilesPerPartition) just before write will create exactly numberOfParquetFilesPerPartition files in each folder. dataframe .repartition(10) .write .mode(SaveMode.Append) .partitionBy("year", "month", "date", "country", "predicate") .p

Spark Streaming - stream between 2 applications

2015-11-20 Thread Saiph Kappa
Hi, I have a basic spark streaming application like this: « ... val ssc = new StreamingContext(sparkConf, Duration(batchMillis)) val rawStreams = (1 to numStreams).map(_ => ssc.rawSocketStream[String](host, port, StorageLevel.MEMORY_ONLY_SER)).toArray val union = ssc.union(rawStreams) union.f

Re: RE: Error not found value sqlContext

2015-11-20 Thread satish chandra j
HI All, I am getting this error while generating executable Jar file itself in Eclipse, if the Spark Application code has "import sqlContext.implicits._" line in it. Spark Applicaiton code works fine if the above mentioned line does not exist as I have tested by fetching data from an RDBMS by impl

回复:RE: Error not found value sqlContext

2015-11-20 Thread prosp4300
Looks like a classpath problem, if you can provide the command you used to run your application and environment variable SPARK_HOME, it will help others to identify the root problem 在2015年11月20日 18:59,Satish 写道: Hi Michael, As my current Spark version is 1.4.0 than why it error out as "error:

How to control number of parquet files generated when using partitionBy

2015-11-20 Thread glennie
I have a DataFrame that I need to write to S3 according to a specific partitioning. The code looks like this: dataframe .write .mode(SaveMode.Append) .partitionBy("year", "month", "date", "country", "predicate") .parquet(outputPath) The partitionBy splits the data into a fairly large numb

RE: Error not found value sqlContext

2015-11-20 Thread Satish
Hi Michael, As my current Spark version is 1.4.0 than why it error out as "error: not found: value sqlContext" when I have "import sqlContext.implicits._" in my Spark Job Regards Satish Chandra -Original Message- From: "Michael Armbrust" Sent: ‎20-‎11-‎2015 01:36 To: "satish chandra j

spark-shell issue Job in illegal state & sparkcontext not serializable

2015-11-20 Thread Balachandar R.A.
Hello users, In one of my usecases, I need to launch a spark job from spark-shell. My input file is in HDFS and I am using NewHadoopRDD to construct RDD out of this input file as it uses custom input format. val hConf = sc.hadoopConfiguration var job = new Job(hConf) FileInputForm

RE: 回复: has any spark write orc document

2015-11-20 Thread Ewan Leith
Looking in the code https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala I don’t think any of that advanced functionality is supported sorry �C there is a parameters option, but I don’t think it’s used for much. Ewan From: zhangjp

Question About Task Number In A spark Stage

2015-11-20 Thread Gerald-G
Hi: Recently we try to submit our spark apps to Yarn-Client Model And find that task numbers in A stage is 2810 But in spark stand alone mode , the same apps need much less tasks Then trough debug info , we know that most of these 2810 task runs NULL DATA How can i tuning this? SUBMI