Model exports PMML (Random Forest)

2015-10-07 Thread Yasemin Kaya
Hi, I want to export my model to PMML. But there is no development about random forest. It is planned to 1.6 version. Is it possible producing my model (random forest) PMML xml format manuelly? Thanks. Best, yasemin -- hiç ender hiç

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread Sean Owen
These are true, but it's not because Spark is written in Scala; it's because it executes in the JVM. So, Scala/Java-based apps have an advantage in that they don't have to serialize data back and forth to a Python process, which also brings a new set of things that can go wrong. Python is also inhe

Re: does KafkaCluster can be public ?

2015-10-07 Thread Erwan ALLAIN
Thanks guys ! On Wed, Oct 7, 2015 at 1:41 AM, Cody Koeninger wrote: > Sure no prob. > > On Tue, Oct 6, 2015 at 6:35 PM, Tathagata Das wrote: > >> Given the interest, I am also inclining towards making it a public >> developer API. Maybe even experimental. Cody, mind submitting a patch? >> >> >>

spark multi tenancy

2015-10-07 Thread Dominik Fries
Hello Folks, We want to deploy several spark projects and want to use a unique project user for each of them. Only the project user should start the spark application and have the corresponding packages installed. Furthermore a personal user, which belongs to a specific project, should start a s

ClassCastException while reading data from HDFS through Spark

2015-10-07 Thread Vinoth Sankar
I'm just reading data from HDFS through Spark. It throws *java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.BytesWritable* at line no 6. I never used LongWritable in my code, no idea how the data was in that format. Note : I'm not using MapRedu

Re: Notification on Spark Streaming job failure

2015-10-07 Thread Steve Loughran
On 7 Oct 2015, at 06:28, Krzysztof Zarzycki mailto:k.zarzy...@gmail.com>> wrote: Hi Vikram, So you give up using yarn-cluster mode of launching Spark jobs, is that right? AFAIK when using yarn-cluster mode, the launch process (spark-submit) monitors job running on YARN, but if it is killed/die

Re: spark multi tenancy

2015-10-07 Thread Steve Loughran
> On 7 Oct 2015, at 09:26, Dominik Fries wrote: > > Hello Folks, > > We want to deploy several spark projects and want to use a unique project > user for each of them. Only the project user should start the spark > application and have the corresponding packages installed. > > Furthermore a p

Re: ClassCastException while reading data from HDFS through Spark

2015-10-07 Thread UMESH CHAUDHARY
As per the Exception, it looks like there is a mismatch in actual sequence file's value type and the one which is provided by you in your code. Change BytesWritable to *LongWritable * and feel the execution. -Umesh On Wed, Oct 7, 2015 at 2:41 PM, Vinoth Sankar wrote: > I'm just reading data fro

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Ophir Cohen
>From which jar WrappedInternalRow comes from? It seems that I can't find it. BTW What I'm trying to do now is to create scala array from the fields and than create Row out of that array. The problem is that I get types mismatches... On Wed, Oct 7, 2015 at 8:03 AM, Hemant Bhanawat wrote: > An a

Re: spark multi tenancy

2015-10-07 Thread ayan guha
Can queues also be used to separate workloads? On 7 Oct 2015 20:34, "Steve Loughran" wrote: > > > On 7 Oct 2015, at 09:26, Dominik Fries > wrote: > > > > Hello Folks, > > > > We want to deploy several spark projects and want to use a unique project > > user for each of them. Only the project use

Re: spark multi tenancy

2015-10-07 Thread Dominik Fries
Currently we try to execute pyspark from user CLI, but in context of project user, but get this error : (the cluster is kerberized) [@edgenode1 ~]$ pyspark --master yarn --num-executors 5 --proxy-user Python 2.7.5 (default, Jun 24 2015, 00:41:19) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-07 Thread Gerard Maas
Thanks for the feedback. Cassandra does not seem to be the issue. The time for writing to Cassandra is in the same order of magnitude (see below) The code structure is roughly as follows: dstream.filter(pred).foreachRDD{rdd => val sparkT0 = currentTimeMs val metrics = rdd.mapPartitions{parti

hiveContext sql number of tasks

2015-10-07 Thread patcharee
Hi, I do a sql query on about 10,000 partitioned orc files. Because of the partition schema the files cannot be merged any longer (to reduce the total number). From this command hiveContext.sql(sqlText), the 10K tasks were created to handle each file. Is it possible to use less tasks? How to

Re: RDD of ImmutableList

2015-10-07 Thread Jakub Dubovsky
I did not realized that scala's and java's immutable collections uses different api which causes this. Thank you for reminder. This makes some sense now... -- Původní zpráva -- Od: Jonathan Coveney Komu: Jakub Dubovsky Datum: 7. 10. 2015 1:29:34 Předmět: Re: RDD of ImmutableLi

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Oh, this is an internal class of our project and I had used it without realizing the source. Anyway, the idea is to wrap the InternalRow in a class that derives from Row. When you implement the functions of the trait 'Row ', the type conversions from Row types to InternalRow types has to be done

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Ophir Cohen
Thanks! Can you check if you can provide example of the conversion? On Wed, Oct 7, 2015 at 2:05 PM, Hemant Bhanawat wrote: > Oh, this is an internal class of our project and I had used it without > realizing the source. > > Anyway, the idea is to wrap the InternalRow in a class that derives fr

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Will send you the code on your email id. On Wed, Oct 7, 2015 at 4:37 PM, Ophir Cohen wrote: > Thanks! > Can you check if you can provide example of the conversion? > > > On Wed, Oct 7, 2015 at 2:05 PM, Hemant Bhanawat > wrote: > >> Oh, this is an internal class of our project and I had used it

Re: question on make multiple external calls within each partition

2015-10-07 Thread Chen Song
Thanks TD and Ashish. On Mon, Oct 5, 2015 at 9:14 PM, Tathagata Das wrote: > You could create a threadpool on demand within the foreachPartitoin > function, then handoff the REST calls to that threadpool, get back the > futures and wait for them to finish. Should be pretty straightforward. Make

Re: RDD of ImmutableList

2015-10-07 Thread Sean Owen
I think Java's immutable collections are fine with respect to kryo -- that's not the same as Guava. On Wed, Oct 7, 2015 at 11:56 AM, Jakub Dubovsky wrote: > I did not realized that scala's and java's immutable collections uses > different api which causes this. Thank you for reminder. This makes

What happens in the master or slave launch ?

2015-10-07 Thread Camelia Elena Ciolac
Hello, I have the following question: I have two scenarios: 1) in one scenario (if I'm connected on the target node) the master starts successfully. Its log contains: Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp /home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/s

Re: Notification on Spark Streaming job failure

2015-10-07 Thread Adrian Tanase
We’re deploying using YARN in cluster mode, to take advantage of automatic restart of long running streaming app. We’ve also done a POC on top of Mesos+Marathon, that’s always an option. For monitoring / alerting, we’re using a combination of: * Spark REST API queried from OpsView via nagio

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-07 Thread Khandeshi, Ami
Tried, multiple permutation of setting home… Still same issue > Sys.setenv(SPARK_HOME="c:\\DevTools\\spark-1.5.1") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths())) > library(SparkR) Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: fi

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Ted Yu
Hemant: Can you post the code snippet to the mailing list - other people would be interested. On Wed, Oct 7, 2015 at 5:50 AM, Hemant Bhanawat wrote: > Will send you the code on your email id. > > On Wed, Oct 7, 2015 at 4:37 PM, Ophir Cohen wrote: > >> Thanks! >> Can you check if you can provide

RE: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-07 Thread Goodall, Mark (UK)
I would like to say that I have also had this issue. In two situations, one using Accumulo to store information and also when running multiple streaming jobs within the same streaming context (e.g. multiple save to hdfs). In my case the situation worsens when one of the jobs, which has a long s

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-07 Thread Cody Koeninger
When you say that the largest difference is from metrics.collect, how are you measuring that? Wouldn't that be the difference between max(partitionT1) and sparkT1, not sparkT0 and sparkT1? As for further places to look, what's happening in the logs during that time? Are the number of messages per

Temp files are not removed when done (Mesos)

2015-10-07 Thread Alexei Bakanov
Hi I'm running Spark 1.5.1 on Mesos in coarse-grained mode. No dynamic allocation or shuffle service. I see that there are two types of temporary files under /tmp folder associated with every executor: /tmp/spark- and /tmp/blockmgr-. When job is finished /tmp/spark- is gone, but blockmgr directory

Re: hiveContext sql number of tasks

2015-10-07 Thread Deng Ching-Mallete
Hi, You can do coalesce(N), where N is the number of partitions you want it reduced to, after loading the data into an RDD. HTH, Deng On Wed, Oct 7, 2015 at 6:34 PM, patcharee wrote: > Hi, > > I do a sql query on about 10,000 partitioned orc files. Because of the > partition schema the files c

What happens in the master or slave launch ?

2015-10-07 Thread camelia
Hello, I have the following question: I have two scenarios: 1) in one scenario (if I'm connected on the target node) the master starts successfully. Its log contains: Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp /home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/spa

This post has NOT been accepted by the mailing list yet.

2015-10-07 Thread akhandeshi
I seem to see this for many of my posts... does anyone have solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/This-post-has-NOT-been-accepted-by-the-mailing-list-yet-tp24969.html Sent from the Apache Spark User List mailing list archive at Nabble.co

spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Hi All, Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes of data). The RDD is partitioned into 2048 partitions which are more or less equal and entirely cached in RAM. I evaluated the performance on several cluster sizes, and am witnessing a non linear (power) performan

Graceful shutdown drops processing in Spark Streaming

2015-10-07 Thread Michal Čizmazia
After triggering the graceful shutdown on the following application, the application stops before the windowed stream reaches its slide duration. As a result, the data is not completely processed (i.e. saveToMyStorage is not called) before shutdown. According to the documentation, graceful shutdow

Re: Temp files are not removed when done (Mesos)

2015-10-07 Thread Iulian Dragoș
It is indeed a bug. I believe the shutdown procedure in #7820 only kicks in when the external shuffle service is enabled (a pre-requisite of dynamic allocation). As a workaround you can use dynamic allocation (you can set spark.dynamicAllocation.maxExecutors and spark.dynamicAllocation.minExecutors

Re: How can I disable logging when running local[*]?

2015-10-07 Thread Alex Kozlov
Hmm, clearly the parameter is not passed to the program. This should be an activator issue. I wonder how do you specify the other parameters, like driver memory, num cores, etc.? Just out of curiosity, can you run a program: import org.apache.spark.SparkConf val out=new SparkConf(true).get("spa

Re: Temp files are not removed when done (Mesos)

2015-10-07 Thread Iulian Dragoș
https://issues.apache.org/jira/browse/SPARK-10975 On Wed, Oct 7, 2015 at 11:36 AM, Iulian Dragoș wrote: > It is indeed a bug. I believe the shutdown procedure in #7820 only kicks > in when the external shuffle service is enabled (a pre-requisite of dynamic > allocation). As a workaround you can

Re: spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Additional missing relevant information: Im running a transformation, there are no Shuffles occurring and at the end im performing a lookup of 4 partitions on the driver. On 10/7/15 11:26 AM, Yadid Ayzenberg wrote: Hi All, Im using spark 1.4.1 to to analyze a largish data set (several Gig

Re: spark performance non-linear response

2015-10-07 Thread Sean Owen
OK, next question then is: if this is wall-clock time for the whole process, then, I wonder if you are just measuring the time taken by the longest single task. I'd expect the time taken by the longest straggler task to follow a distribution like this. That is, how balanced are the partitions? Are

Re: spark performance non-linear response

2015-10-07 Thread Jonathan Coveney
I've noticed this as well and am curious if there is anything more people can say. My theory is that it is just communication overhead. If you only have a couple of gigabytes (a tiny dataset), then spotting that into 50 nodes means you'll have a ton of tiny partitions all finishing very quickly, a

DataFrame with bean class

2015-10-07 Thread VJ
I have a bean class defined as follows: class result { private String name; public result() { }; public String getname () {return name;} public void setname (String s) {name = s;) } I then define DataFrame x = SqlContext.createDataFrame(myrdd, result.class); x.show() When I run this job, I am

Re: This post has NOT been accepted by the mailing list yet.

2015-10-07 Thread Richard Hillegas
Hi Akhandeshi, It may be that you are not seeing your own posts because you are sending from a gmail account. See for instance https://support.google.com/a/answer/1703601?hl=en Hope this helps, Rick Hillegas STSM, IBM Analytics, Platform - IBM USA akhandeshi wrote on 10/07/2015 08:10:32 AM:

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread sethah
Regarding features, the general workflow for the Spark community when adding new features is to first add them in Scala (since Spark is written in Scala). Once this is done, a Jira ticket will be created requesting that the feature be added to the Python API (example - SPARK-9773

Optimal way to avoid processing null returns in Spark Scala

2015-10-07 Thread swetha
Hi, I have the following functions that I am using for my job in Scala. If you see the getSessionId function I am returning null sometimes. If I return null the only way that I can avoid processing those records is by filtering out null records. I wanted to avoid having another pass for filtering

Re: Spark job workflow engine recommendations

2015-10-07 Thread Hien Luu
The spark job type was added recently - see this pull request https://github.com/azkaban/azkaban-plugins/pull/195. You can leverage the SLA feature to kill a job if it ran longer than expected. BTW, we just solved the scalability issue by supporting multiple executors. Within a week or two, the

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread Michael Armbrust
> > At my company we use Avro heavily and it's not been fun when i've tried to > work with complex avro schemas and python. This may not be relevant to you > however...otherwise I found Python to be a great fit for Spark :) > Have you tried using https://github.com/databricks/spark-avro ? It shoul

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
Hi YiZhi Liu, The spark.ml classes are part of the higher-level "Pipelines" API, which works with DataFrames. When creating this API, we decided to separate it from the old API to avoid confusion. You can read more about it here: http://spark.apache.org/docs/latest/ml-guide.html For (3): We use

Re: Spark job workflow engine recommendations

2015-10-07 Thread Vikram Kone
Hien, I saw this pull request and from what I understand this is geared towards running spark jobs over hadoop. We are using spark over cassandra and not sure if this new jobtype supports that. I haven't seen any documentation in regards to how to use this spark job plugin, so that I can test it ou

Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
When running stand-alone cluster mode job, the process hangs up randomly during a DataFrame flatMap or explode operation, in HiveContext: -->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r) This does not happen either with SQLContext in cluster, or Hive/SQL in local mode, where it works

Re: compatibility issue with Jersey2

2015-10-07 Thread Gary Ogden
What you suggested seems to have worked for unit tests. But now it throws this at run time on mesos with spark-submit: Exception in thread "main" java.lang.LinkageError: loader constraint violation: when resolving method "org.slf4j.impl.StaticLoggerBinder.getLoggerFactory()Lorg/slf4j/ILoggerFactor

Re: Spark job workflow engine recommendations

2015-10-07 Thread Nick Pentreath
We're also using Azkaban for scheduling, and we simply use spark-submit via she'll scripts. It works fine. The auto retry feature with a large number of retries (like 100 or 1000 perhaps) should take care of long-running jobs with restarts on failure. We haven't used it for streaming yet tho

Re: Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Sean Owen
-dev Is r.getInt(ind) very large in some cases? I think there's not quite enough info here. On Wed, Oct 7, 2015 at 6:23 PM, wrote: > When running stand-alone cluster mode job, the process hangs up randomly > during a DataFrame flatMap or explode operation, in HiveContext: > > -->> df.flatMap(r

RE: Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
It can be large yes. But, that still does not resolve the question of why it works in smaller environment, i.e. Local[32] or in cluster mode when using SQLContext instead of HiveContext. The process in general, is a RowNumber() hiveQL operation, that is why I need HiveContext. I have the feeli

Parquet file size

2015-10-07 Thread Younes Naguib
Hi, I'm reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month, day) Select from tbl_tsv; This works nicely, but generates small parquet files (15MB). I wanted to generate larger files, any idea how to address this? Thanks,

Re: compatibility issue with Jersey2

2015-10-07 Thread Marcelo Vanzin
Seems like you might be running into https://issues.apache.org/jira/browse/SPARK-10910. I've been busy with other things but plan to take a look at that one when I find time... right now I don't really have a solution, other than making sure your application's jars do not include those classes the

Re: spark multi tenancy

2015-10-07 Thread Steve Loughran
On 7 Oct 2015, at 11:06, ayan guha mailto:guha.a...@gmail.com>> wrote: Can queues also be used to separate workloads? yes; that's standard practise. Different YARN queues can have different maximum memory & CPU, and you can even tag queues as "pre-emptible", so more important work can kill c

Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month, day)..

Spark Streaming: Doing operation in Receiver vs RDD

2015-10-07 Thread emiretsk
Hi, I have a Spark Streaming program that is consuming message from Kafka and has to decrypt and deserialize each message. I can implement it either as Kafka deserializer (that will run in a receiver or the new receiver-less Kafka consumer) or as RDD operations. What are the pros/cons of each?

Fwd: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Reynold Xin
Adding user list too. -- Forwarded message -- From: Reynold Xin Date: Tue, Oct 6, 2015 at 5:54 PM Subject: Re: multiple count distinct in SQL/DataFrame? To: "d...@spark.apache.org" To provide more context, if we do remove this feature, the following SQL query would throw an A

RE: Parquet file size

2015-10-07 Thread Younes Naguib
The TSV original files is 600GB and generated 40k files of 15-25MB. y From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: October-07-15 3:18 PM To: Younes Naguib; 'user@spark.apache.org' Subject: Re: Parquet file size Why do you want larger files? Doesn't the result Parquet file contain all th

Re: Parquet file size

2015-10-07 Thread Cheng Lian
The reason why so many small files are generated should probably be the fact that you are inserting into a partitioned table with three partition columns. If you want a large Parquet files, you may try to either avoid using partitioned table, or using less partition columns (e.g., only year,

RE: Parquet file size

2015-10-07 Thread Younes Naguib
Well, I only have data for 2015-08. So, in the end, only 31 partitions What I'm looking for, is some reasonably sized partitions. In any case, just the idea of controlling the output parquet files size or number would be nice. Younes Naguib Streaming Division Triton Digital | 1440 Ste-Catheri

Re: Asking about the trend of increasing latency, hbase spikes.

2015-10-07 Thread Ted Yu
This question should be directed to user@ Can you use third party site for the images - they didn't go through. On Wed, Oct 7, 2015 at 5:35 PM, UI-JIN LIM wrote: > Hi. This is Ui Jin, Lim in Korea, LG CNS > > > > We had setup and are operating hbase 0.98.13 on our customer, iptv > Company. > >

Re: Parquet file size

2015-10-07 Thread Deng Ching-Mallete
Hi, In our case, we're using the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to increase the size of the RDD partitions when loading text files, so it would generate larger parquet files. We just set it in the Hadoop conf of the SparkContext. You need to be careful though a

Re: Graceful shutdown drops processing in Spark Streaming

2015-10-07 Thread Tathagata Das
Aaah, interesting, you are doing 15 minute slide duration. Yeah, internally the streaming scheduler waits for the last "batch" interval which has data to be processed, but if there is a sliding interval (i.e. 15 mins) that is higher than batch interval, then that might not be run. This is indeed a

RE: Parquet file size

2015-10-07 Thread Younes Naguib
Thanks, I'll try that. Younes Naguib Streaming Division Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC H3G 1R8 Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.nag...@tritondigital.com

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Michael Armbrust
-dev +user 1). Is that the reason why it's always slow in the first run? Or are there > any other reasons? Apparently it loads data to memory every time so it > shouldn't be something to do with disk read should it? > You are probably seeing the effect of the JVMs JIT. The first run is executing

Re: How can I disable logging when running local[*]?

2015-10-07 Thread Dean Wampler
Try putting the following in the SBT build file, e.g., ./build.sbt: // Fork when running, so you can ... fork := true // ... add a JVM option to use when forking a JVM for 'run' javaOptions += "-Dlog4j.configuration=file:/" dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Is coalesce smart while merging partitions?

2015-10-07 Thread Cesar Flores
It is my understanding that the default behavior of coalesce function when the user reduce the number of partitions is to only merge them without executing shuffle. My question is: Is this merging smart? For example does spark try to merge the small partitions first or the election of partitions t

Re: Graceful shutdown drops processing in Spark Streaming

2015-10-07 Thread Michal Čizmazia
Thanks! Done. https://issues.apache.org/jira/browse/SPARK-10995 On 7 October 2015 at 21:24, Tathagata Das wrote: > Aaah, interesting, you are doing 15 minute slide duration. Yeah, > internally the streaming scheduler waits for the last "batch" interval > which has data to be processed, but if t

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Please find attached. On Wed, Oct 7, 2015 at 7:36 PM, Ted Yu wrote: > Hemant: > Can you post the code snippet to the mailing list - other people would be > interested. > > On Wed, Oct 7, 2015 at 5:50 AM, Hemant Bhanawat > wrote: > >> Will send you the code on your email id. >> >> On Wed, Oct

Default size of a datatype in SparkSQL

2015-10-07 Thread vivek bhaskar
I want to understand whats use of default size for a given datatype? Following link mention that its for internal size estimation. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DataType.html Above behavior is also reflected in code where default value seems to be used f

Running Spark in Yarn-client mode

2015-10-07 Thread Sushrut Ikhar
Hi, I am new to Spark and I have been trying to run Spark in yarn-client mode. I get this error in yarn logs : Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend Also, I keep getting these warnings: WARN YarnScheduler: Initial job has not accepted any

Re: Running Spark in Yarn-client mode

2015-10-07 Thread Jean-Baptiste Onofré
Hi Sushrut, which packaging of Spark do you use ? Do you have a working Yarn cluster (with at least one worker) ? spark-hadoop-x ? Regards JB On 10/08/2015 07:23 AM, Sushrut Ikhar wrote: Hi, I am new to Spark and I have been trying to run Spark in yarn-client mode. I get this error in yarn l

Re: Running Spark in Yarn-client mode

2015-10-07 Thread Sushrut Ikhar
Hey Jean, Thanks for the quick response. I am using spark 1.4.1 pre-built with hadoop 2.6. Yes the Yarn cluster has multiple running worker nodes. It would a great help if you can tell how to look for the executors logs. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar