Re: Request for FP-Growth source code

2021-06-28 Thread Eduardus Hardika Sandy Atmaja
Yes, it is working now..Thank you very much. Best Regards, Eduardus Hardika Sandy Atmaja From: Russell Spitzer Sent: Monday, June 28, 2021 11:22 PM To: Eduardus Hardika Sandy Atmaja Cc: user Subject: Re: Request for FP-Growth source code Sorry wrong repository

Request for FP-Growth source code

2021-06-28 Thread Eduardus Hardika Sandy Atmaja
request. Best Regards, Eduardus Hardika Sandy Atmaja

[ML] Linear regression with SGD

2018-07-13 Thread sandy
problems with LinearRegressionSGD and saying that it it slower than L-BFGS but I am not sure what they mean. Shouldn’t SGD be better? Is there any plan to make those functions available again in the new DataFrame-based API? Thank you, Sandy -- Sent from: http://apache-spark-user-list.1001560

singular value decomposition in Spark ML

2016-08-04 Thread Sandy Ryza
Hi, Is SVD or PCA in Spark ML (i.e. spark.ml parity with the mllib RowMatrix.computeSVD API) slated for any upcoming release? Many thanks for any guidance! -Sandy

Re: Content based window operation on Time-series data

2015-12-17 Thread Sandy Ryza
Hi Arun, A Java API was actually recently added to the library. It will be available in the next release. -Sandy On Thu, Dec 10, 2015 at 12:16 AM, Arun Verma wrote: > Thank you for your reply. It is a Scala and Python library. Is similar > library exists for Java? > > On Wed, De

Re: PySpark Lost Executors

2015-11-19 Thread Sandy Ryza
Hi Ross, This is most likely occurring because YARN is killing containers for exceeding physical memory limits. You can make this less likely to happen by bumping spark.yarn.executor.memoryOverhead to something higher than 10% of your spark.executor.memory. -Sandy On Thu, Nov 19, 2015 at 8:14

Re: SequenceFile and object reuse

2015-11-18 Thread Sandy Ryza
record, because it avoids the unnecessary overhead of creating Java objects. As you've pointed out, this is at the expense of making the code more verbose when caching. -Sandy On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi wrote: > So we tried reading a sequencefile in Spark and realized that

Re: Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Sandy Ryza
Hi Nisrina, The resources you specify are shared by all jobs that run inside the application. -Sandy On Wed, Nov 4, 2015 at 9:24 AM, Nisrina Luthfiyati < nisrina.luthfiy...@gmail.com> wrote: > Hi all, > > I'm running some spark jobs in java on top of YARN by submitting o

Re: Spark tunning increase number of active tasks

2015-10-31 Thread Sandy Ryza
/03/how-to-tune-your-apache-spark-jobs-part-2/ has a more detailed explanation of why this happens. -Sandy On Sat, Oct 31, 2015 at 4:29 AM, Jörn Franke wrote: > Maybe Hortonworks support can help you much better. > > Otherwise you may want to change the yarn scheduler configuration

Re: Spark 1.5 on CDH 5.4.0

2015-10-22 Thread Sandy Ryza
using the -Pyarn flag. -Sandy On Thu, Oct 22, 2015 at 9:04 AM, Deenar Toraskar wrote: > Hi I have got the prebuilt version of Spark 1.5 for Hadoop 2.6 ( > http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz) > working with CDH 5.4.0 in local mode on

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Sandy Ryza
stage including the InputSplits gets submitted, Spark will try to request an appropriate number of executors. The memory in the YARN resource requests is --executor-memory + what's set for spark.yarn.executor.memoryOverhead, which defaults to 10% of --executor-memory. -Sandy On Wed, Sep 23, 2015

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Sandy Ryza
ty to give the executors some additional headroom above the heap space. -Sandy On Mon, Sep 21, 2015 at 5:43 PM, Saisai Shao wrote: > I think you need to increase the memory size of executor through command > arguments "--executor-memory", or configuration "s

Re: Spark on Yarn vs Standalone

2015-09-10 Thread Sandy Ryza
YARN will never kill processes for being unresponsive. It may kill processes for occupying more memory than it allows. To get around this, you can either bump spark.yarn.executor.memoryOverhead or turn off the memory checks entirely with yarn.nodemanager.pmem-check-enabled. -Sandy On Tue, Sep

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Sandy Ryza
Java 7. FWIW I was just able to get it to work by increasing MaxPermSize to 256m. -Sandy On Wed, Sep 9, 2015 at 11:37 AM, Reynold Xin wrote: > Java 7 / 8? > > On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza > wrote: > >> I just upgraded the spark-timeseries >> <htt

Driver OOM after upgrading to 1.5

2015-09-09 Thread Sandy Ryza
6064 2:163428 21112648 3: 12638 14459192 4: 12638 13455904 5: 105397642528 Not sure whether this is suspicious. Any ideas? -Sandy

Re: Spark on Yarn vs Standalone

2015-09-08 Thread Sandy Ryza
Those settings seem reasonable to me. Are you observing performance that's worse than you would expect? -Sandy On Mon, Sep 7, 2015 at 11:22 AM, Alexander Pivovarov wrote: > Hi Sandy > > Thank you for your reply > Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB) >

Re: Spark on Yarn vs Standalone

2015-09-07 Thread Sandy Ryza
Hi Alex, If they're both configured correctly, there's no reason that Spark Standalone should provide performance or memory improvement over Spark on YARN. -Sandy On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov wrote: > Hi Everyone > > We are trying the latest aws emr-

Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-08-31 Thread Sandy Ryza
completes less quickly, have you checked to see whether YARN is killing any containers? It could be that the job completes more slowly because, without the memory overhead, YARN kills containers while it's running. So it needs to run some tasks multiple times. -Sandy On Sat, Aug 29, 2015 at 6:

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
se task metrics. -Sandy On Thu, Aug 20, 2015 at 8:54 AM, Umesh Kacha wrote: > Hi where do I see GC time in UI? I have set spark.yarn.executor.memoryOverhead > as 3500 which seems to be good enough I believe. So you mean only GC could > be the reason behind timeout I checked Yarn logs I

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
may be killing your executors for using too much off-heap space. You can see whether this is happening by looking in the Spark AM or YARN NodeManager logs. -Sandy On Thu, Aug 20, 2015 at 7:39 AM, Umesh Kacha wrote: > Hi thanks much for the response. Yes I tried default settings too 0.2 it >

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
What version of Spark are you using? Have you set any shuffle configs? On Wed, Aug 19, 2015 at 11:46 AM, unk1102 wrote: > I have one Spark job which seems to run fine but after one hour or so > executor start getting lost because of time out something like the > following > error > > cluster.ya

Re: Executors on multiple nodes

2015-08-16 Thread Sandy Ryza
-allocation . -Sandy On Sat, Aug 15, 2015 at 6:40 AM, Mohit Anchlia wrote: > I am running on Yarn and do have a question on how spark runs executors on > different data nodes. Is that primarily decided based on number of > receivers? > > What do I need to do to ensure that mul

Re: Boosting spark.yarn.executor.memoryOverhead

2015-08-11 Thread Sandy Ryza
Hi Eric, This is likely because you are putting the parameter after the primary resource (latest_msmtdt_by_gridid_and_source.py), which makes it a parameter to your application instead of a parameter to Spark/ -Sandy On Wed, Aug 12, 2015 at 4:40 AM, Eric Bless wrote: > Previously I

Re: Spark on YARN

2015-08-08 Thread Sandy Ryza
Hi Jem, Do they fail with any particular exception? Does YARN just never end up giving them resources? Does an application master start? If so, what are in its logs? If not, anything suspicious in the YARN ResourceManager logs? -Sandy On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker wrote: >

Re: [General Question] [Hadoop + Spark at scale] Spark Rack Awareness ?

2015-07-19 Thread Sandy Ryza
Hi Mike, Spark is rack-aware in its task scheduling. Currently Spark doesn't honor any locality preferences when scheduling executors, but this is being addressed in SPARK-4352, after which executor-scheduling will be rack-aware as well. -Sandy On Sat, Jul 18, 2015 at 6:25 PM, Mike Fra

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Sandy Ryza
Can you try setting the spark.yarn.jar property to make sure it points to the jar you're thinking of? -Sandy On Fri, Jul 17, 2015 at 11:32 AM, Arun Ahuja wrote: > Yes, it's a YARN cluster and using spark-submit to run. I have SPARK_HOME > set to the directory above and using

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Sandy Ryza
Hi Jonathan, This is a problem that has come up for us as well, because we'd like dynamic allocation to be turned on by default in some setups, but not break existing users with these properties. I'm hoping to figure out a way to reconcile these by Spark 1.5. -Sandy On Wed, Jul 15,

Re: How to restrict disk space for spark caches on yarn?

2015-07-13 Thread Sandy Ryza
which happens to be in a local directory that YARN gives it. Based on its title, if YARN-882 were resolved, it would do nothing to limit the amount of on-disk cache space Spark could use. -Sandy On Mon, Jul 13, 2015 at 6:57 AM, Peter Rudenko wrote: > Hi Andrew, here's what i found. M

Re: Pyspark not working on yarn-cluster mode

2015-07-10 Thread Sandy Ryza
To add to this, conceptually, it makes no sense to launch something in yarn-cluster mode by creating a SparkContext on the client - the whole point of yarn-cluster mode is that the SparkContext runs on the cluster, not on the client. On Thu, Jul 9, 2015 at 2:35 PM, Marcelo Vanzin wrote: > You ca

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza
Strange. Does the application show up at all in the YARN web UI? Does application_1436314873375_0030 show up at all in the YARN ResourceManager logs? -Sandy On Wed, Jul 8, 2015 at 3:32 PM, Juan Gordon wrote: > Hello Sandy, > > Yes I'm sure that YARN has the enought resources, i

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza
Hi JG, One way this can occur is that YARN doesn't have enough resources to run your job. Have you verified that it does? Are you able to submit using the same command from a node on the cluster? -Sandy On Wed, Jul 8, 2015 at 3:19 PM, jegordon wrote: > I'm trying to submit a s

Re: Executors requested are way less than what i actually got

2015-06-26 Thread Sandy Ryza
The scheduler configurations are helpful as well, but not useful without the information outlined above. -Sandy On Fri, Jun 26, 2015 at 10:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > These are my YARN queue configurations > > Queue State:RUNNINGUsed Capacity:206.7%Absolute Used Capacity:3.1

Re: Executors requested are way less than what i actually got

2015-06-25 Thread Sandy Ryza
How many nodes do you have, how much space is allocated to each node for YARN, how big are the executors you're requesting, and what else is running on the cluster? On Thu, Jun 25, 2015 at 3:57 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I run Spark App on Spark 1.3.1 over YARN. > > When i request --num-executor

Re: When to use underlying data management layer versus standalone Spark?

2015-06-24 Thread Sandy Ryza
stems need random access to your data, you'd want to consider a system like HBase and Cassandra, though these are likely to suffer a little bit on performance and incur higher operational overhead. -Sandy On Tue, Jun 23, 2015 at 11:21 PM, Sonal Goyal wrote: > When you deploy spark ove

Re: Spark launching without all of the requested YARN resources

2015-06-24 Thread Sandy Ryza
Hi Arun, You can achieve this by setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really high number and spark.scheduler.minRegisteredResourcesRatio to 1.0. -Sandy On Wed, Jun 24, 2015 at 2:21 AM, Steve Loughran wrote: > > On 24 Jun 2015, at 05:55, canan chen

Re: Velox Model Server

2015-06-20 Thread Sandy Ryza
Oops, that link was for Oryx 1. Here's the repo for Oryx 2: https://github.com/OryxProject/oryx On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza wrote: > Hi Debasish, > > The Oryx project (https://github.com/cloudera/oryx), which is Apache 2 > licensed, contains a model server that

Re: Velox Model Server

2015-06-20 Thread Sandy Ryza
Hi Debasish, The Oryx project (https://github.com/cloudera/oryx), which is Apache 2 licensed, contains a model server that can serve models built with MLlib. -Sandy On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl wrote: > Is velox NOT open source? > > > On Saturday, June 20, 2015,

Re: [SparkScore] Performance portal for Apache Spark

2015-06-17 Thread Sandy Ryza
This looks really awesome. On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie wrote: > Hi All > > We are happy to announce Performance portal for Apache Spark > http://01org.github.io/sparkscore/ ! > > The Performance Portal for Apache Spark provides performance data on the > Spark upsteam to the com

Re: deployment options for Spark and YARN w/ many app jar library dependencies

2015-06-17 Thread Sandy Ryza
Hi Matt, If you place your jars on HDFS in a public location, YARN will cache them on each node after the first download. You can also use the spark.executor.extraClassPath config to point to them. -Sandy On Wed, Jun 17, 2015 at 4:47 PM, Sweeney, Matt wrote: > Hi folks, > > I’m l

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Sandy Ryza
Hi Patrick, I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic allocation in 1.4 that permitted requesting negative numbers of executors. Any chance you'd be able to try with the newer version and see if the problem persists? -Sandy On Fri, Jun 12, 2015 at 7

Re: Determining number of executors within RDD

2015-06-10 Thread Sandy Ryza
On YARN, there is no concept of a Spark Worker. Multiple executors will be run per node without any effort required by the user, as long as all the executors fit within each node's resource limits. -Sandy On Wed, Jun 10, 2015 at 3:24 PM, Evo Eftimov wrote: > Yes i think it is ONE wo

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza
That might work, but there might also be other steps that are required. -Sandy On Thu, Jun 4, 2015 at 11:13 AM, Saiph Kappa wrote: > Thanks! It is working fine now with spark-submit. Just out of curiosity, > how would you use org.apache.spark.deploy.yarn.Client? Adding that > spark_ya

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza
spark-submit is the recommended way of launching Spark applications on YARN, because it takes care of submitting the right jars as well as setting up the classpath and environment variables appropriately. -Sandy On Thu, Jun 4, 2015 at 10:30 AM, Saiph Kappa wrote: > No, I am not. I run it w

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza
Hi Saiph, Are you launching using spark-submit? -Sandy On Thu, Jun 4, 2015 at 10:20 AM, Saiph Kappa wrote: > Hi, > > I've been running my spark streaming application in standalone mode > without any worries. Now, I've been trying to run it on YARN (hadoop 2.7.0)

Re: data localisation in spark

2015-06-03 Thread Sandy Ryza
reducebyKey with parallelism = 10. If there are fewer slots to run tasks than tasks, the tasks will just be run serially. -Sandy On Tue, Jun 2, 2015 at 11:24 AM, Shushant Arora wrote: > So in spark is after acquiring executors from ClusterManeger, does tasks > are scheduled on executors

Re: data localisation in spark

2015-06-02 Thread Sandy Ryza
needs to be requested before Spark knows what tasks it will run. Although dynamic allocation improves that last part. -Sandy On Tue, Jun 2, 2015 at 9:55 AM, Shushant Arora wrote: > Is it possible in JavaSparkContext ? > > JavaSparkContext jsc = new JavaSparkContext(conf); >

Re: data localisation in spark

2015-05-31 Thread Sandy Ryza
Hi Shushant, Spark currently makes no effort to request executors based on data locality (although it does try to schedule tasks within executors based on data locality). We're working on adding this capability at SPARK-4352 <https://issues.apache.org/jira/browse/SPARK-4352>. -Sa

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Sandy Ryza
Hi Corey, As of this PR https://github.com/apache/spark/pull/5297/files, this can be controlled with spark.yarn.submit.waitAppCompletion. -Sandy On Thu, May 28, 2015 at 11:48 AM, Corey Nolet wrote: > I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm > notic

Re: number of executors

2015-05-18 Thread Sandy Ryza
Awesome! It's documented here: https://spark.apache.org/docs/latest/submitting-applications.html -Sandy On Mon, May 18, 2015 at 8:03 PM, xiaohe lan wrote: > Hi Sandy, > > Thanks for your information. Yes, spark-submit --master yarn > --num-executors 5 --executor-cores 4 &g

Re: number of executors

2015-05-18 Thread Sandy Ryza
*All On Mon, May 18, 2015 at 9:07 AM, Sandy Ryza wrote: > Hi Xiaohe, > > The all Spark options must go before the jar or they won't take effect. > > -Sandy > > On Sun, May 17, 2015 at 8:59 AM, xiaohe lan > wrote: > >> Sorry, them both are assigned task

Re: number of executors

2015-05-18 Thread Sandy Ryza
Hi Xiaohe, The all Spark options must go before the jar or they won't take effect. -Sandy On Sun, May 17, 2015 at 8:59 AM, xiaohe lan wrote: > Sorry, them both are assigned task actually. > > Aggregated Metrics by Executor > Executor IDAddressTask TimeTotal TasksFail

Re: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread Sandy Ryza
-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ -Sandy On Thu, Apr 30, 2015 at 5:03 PM, java8964 wrote: > Really not expert here, but try the following ideas: > > 1) I assume you are using yarn, then this blog is very good

Re: Question about Memory Used and VCores Used

2015-04-29 Thread Sandy Ryza
/ -Sandy On Tue, Apr 28, 2015 at 7:12 PM, bit1...@163.com wrote: > Hi,guys, > I have the following computation with 3 workers: > spark-sql --master yarn --executor-memory 3g --executor-cores 2 > --driver-memory 1g -e 'select count(*) from table' > > The resources used are s

Re: Running beyond physical memory limits

2015-04-15 Thread Sandy Ryza
The setting to increase is spark.yarn.executor.memoryOverhead On Wed, Apr 15, 2015 at 6:35 AM, Brahma Reddy Battula < brahmareddy.batt...@huawei.com> wrote: > Hello Sean Owen, > > Thanks for your reply..I"ll increase overhead memory and check it.. > > > Bytheway ,Any difference between 1.1 and 1.

Re: Spark: Using "node-local" files within functions?

2015-04-14 Thread Sandy Ryza
me file, a better option would be to pass the file in with the --files option when you spark-submit, which will cache the file between executors on the same node. -Sandy On Tue, Apr 14, 2015 at 1:39 AM, Horsmann, Tobias < tobias.horsm...@uni-due.de> wrote: > Hi, > > I am trying to

Re: Rack locality

2015-04-13 Thread Sandy Ryza
Hi Riya, As far as I know, that is correct, unless Mesos fine-grained mode handles this in some mysterious way. -Sandy On Mon, Apr 13, 2015 at 2:09 PM, rcharaya wrote: > I want to use Rack locality feature of Apache Spark in my application. > > Is YARN the only resource manager which

Re: Spark Job Run Resource Estimation ?

2015-04-09 Thread Sandy Ryza
Hi Deepak, I'm going to shamelessly plug my blog post on tuning Spark: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ It talks about tuning executor size as well as how the number of tasks for a stage is calculated. -Sandy On Thu, Apr 9, 2015 at 9:21 AM,

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-06 Thread Sandy Ryza
pr 1, 2015 at 7:08 PM, twinkle sachdeva wrote: > Hi, > > Thanks Sandy. > > > Another way to look at this is that would we like to have our long running > application to die? > > So let's say, we create a window of around 10 batches, and we are using > incremen

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza
so the records corresponding to a particular partition at the end of the first job can end up split across multiple partitions in the second job. -Sandy On Wed, Apr 1, 2015 at 9:09 PM, kjsingh wrote: > Hi, > > We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of > Tuple2.

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread Sandy Ryza
That's a good question, Twinkle. One solution could be to allow a maximum number of failures within any given time span. E.g. a max failures per hour property. -Sandy On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva < twinkle.sachd...@gmail.com> wrote: > Hi, > > In spark

Re: Cross-compatibility of YARN shuffle service

2015-03-26 Thread Sandy Ryza
Hi Matt, I'm not sure whether we have documented compatibility guidelines here. However, a strong goal is to keep the external shuffle service compatible so that many versions of Spark can run against the same shuffle service. -Sandy On Wed, Mar 25, 2015 at 6:44 PM, Matt Cheah wrote:

Re: What is best way to run spark job in "yarn-cluster" mode from java program(servlet container) and NOT using spark-submit command.

2015-03-26 Thread Sandy Ryza
Creating a SparkContext and setting master as yarn-cluster unfortunately will not work. SPARK-4924 added APIs for doing this in Spark, but won't be included until 1.4. -Sandy On Tue, Mar 17, 2015 at 3:19 AM, Akhil Das wrote: > Create SparkContext set master as yarn-cluster then run

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Sandy Ryza
Hi Sachin, It appears that the application master is failing. To figure out what's wrong you need to get the logs for the application master. -Sandy On Wed, Mar 25, 2015 at 7:05 AM, Sachin Singh wrote: > OS I am using Linux, > when I will run simply as master yarn, its r

Re: Is yarn-standalone mode deprecated?

2015-03-24 Thread Sandy Ryza
I checked and apparently it hasn't be released yet. it will be available in the upcoming CDH 5.4 release. -Sandy On Mon, Mar 23, 2015 at 1:32 PM, Nitin kak wrote: > I know there was an effort for this, do you know which version of Cloudera > distribution we could find that? > &g

Re: How to avoid being killed by YARN node manager ?

2015-03-24 Thread Sandy Ryza
Hi Yuichiro, The way to avoid this is to boost spark.yarn.executor.memoryOverhead until the executors have enough off-heap memory to avoid going over their limits. -Sandy On Tue, Mar 24, 2015 at 11:49 AM, Yuichiro Sakamoto wrote: > Hello. > > We use ALS(Collaborative filtering) of Sp

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-24 Thread Sandy Ryza
Ah, yes, I believe this is because only properties prefixed with "spark" get passed on. The purpose of the "--conf" option is to allow passing Spark properties to the SparkConf, not to add general key-value pairs to the JVM system properties. -Sandy On Tue, Mar 24, 2015 at

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: "e04"

2015-03-24 Thread Sandy Ryza
Steve, that's correct, but the problem only shows up when different versions of the YARN jars are included on the classpath. -Sandy On Tue, Mar 24, 2015 at 6:29 AM, Steve Loughran wrote: > > > On 24 Mar 2015, at 02:10, Marcelo Vanzin wrote: > > > > This happens most

Re: Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Sandy Ryza
, and the on-disk version can be compressed as well. -Sandy On Mon, Mar 23, 2015 at 5:29 PM, Bijay Pathak wrote: > Hello, > > I am running TeraSort <https://github.com/ehiggs/spark-terasort> on > 100GB of data. The final metrics I am getting on Shuffle Spill are: > > Shuf

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-23 Thread Sandy Ryza
Hi Emre, The --conf property is meant to work with yarn-cluster mode. System.getProperty("key") isn't guaranteed, but new SparkConf().get("key") should. Does it not? -Sandy On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc wrote: > Hello, > > According

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
The former is deprecated. However, the latter is functionally equivalent to it. Both launch an app in what is now called "yarn-cluster" mode. Oozie now also has a native Spark action, though I'm not familiar on the specifics. -Sandy On Mon, Mar 23, 2015 at 1:01 PM, Nitin kak

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
The mode is not deprecated, but the name "yarn-standalone" is now deprecated. It's now referred to as "yarn-cluster". -Sandy On Mon, Mar 23, 2015 at 11:49 AM, nitinkak001 wrote: > Is yarn-standalone mode deprecated in Spark now. The reason I am asking is > becau

Re: No executors allocated on yarn with latest master branch

2015-03-09 Thread Sandy Ryza
> > On Sat, Feb 21, 2015 at 12:05 AM, Sandy Ryza > wrote: > >> Are you using the capacity scheduler or fifo scheduler without multi >> resource scheduling by any chance? >> >> On Thu, Feb 12, 2015 at 1:51 PM, Anders Arpteg >> wrote: >> >>>

Re: No executors allocated on yarn with latest master branch

2015-02-20 Thread Sandy Ryza
ontainermanager.ContainerManagerImpl: > Event EventType: FINISH_APPLICATION sent to absent application > application_1422406067005_0053 > > On Thu, Feb 12, 2015 at 10:38 PM, Sandy Ryza > wrote: > >> It seems unlikely to me that it would be a 2.2 issue, though not entire

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
That's all correct. -Sandy On Fri, Feb 20, 2015 at 1:23 PM, Kelvin Chu <2dot7kel...@gmail.com> wrote: > Hi Sandy, > > I appreciate your clear explanation. Let me try again. It's the best way > to confirm I understand. > > spark.executor.memory + spark.yarn.ex

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
spark.storage.memoryFraction (default 0.6) and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic Spark bookkeeping and anything the user does inside UDFs. -Sandy On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu <2dot7kel...@gmail.com> wrote: > Hi Sandy, > > I am also doing memory tunin

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman wrote: > A bit mo

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Are you specifying the executor memory, cores, or number of executors anywhere? If not, you won't be taking advantage of the full resources on the cluster. -Sandy On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen wrote: > None of this really points to the problem. These indicate that worker

Re: build spark for cdh5

2015-02-18 Thread Sandy Ryza
Hi Koert, You should be using "-Phadoop-2.3" instead of "-Phadoop2.3". -Sandy On Wed, Feb 18, 2015 at 10:51 AM, Koert Kuipers wrote: > does anyone have the right maven invocation for cdh5 with yarn? > i tried: > $ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
It seems unlikely to me that it would be a 2.2 issue, though not entirely impossible. Are you able to find any of the container logs? Is the NodeManager launching containers and reporting some exit code? -Sandy On Thu, Feb 12, 2015 at 1:21 PM, Anders Arpteg wrote: > No, not submitting f

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Sandy Ryza
What version of Java are you using? Core NLP dropped support for Java 7 in its 3.5.0 release. Also, the correct command line option is --jars, not --addJars. On Thu, Feb 12, 2015 at 12:03 PM, Deborah Siegel wrote: > Hi Abe, > I'm new to Spark as well, so someone else could answer better. A few

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
at >> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:551) >> at >> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:155) >> at >> org.apache.spark.deploy.SparkSubmit$.submit(SparkS

feeding DataFrames into predictive algorithms

2015-02-11 Thread Sandy Ryza
bel column that I can feed into a regression. So far all the paths I've gone down have led me to internal APIs or convoluted casting in and out of RDD[Row] and DataFrame. Is there a simple way of accomplishing this? any assistance (lookin' at you Xiangrui) much appreciated, Sandy

Re: No executors allocated on yarn with latest master branch

2015-02-11 Thread Sandy Ryza
Hi Anders, I just tried this out and was able to successfully acquire executors. Any strange log messages or additional color you can provide on your setup? Does yarn-client mode work? -Sandy On Wed, Feb 11, 2015 at 1:28 PM, Anders Arpteg wrote: > Hi, > > Compiled the latest master

Re: Open file limit settings for Spark on Yarn job

2015-02-10 Thread Sandy Ryza
Hi Arun, The limit for the YARN user on the cluster nodes should be all that matters. What version of Spark are you using? If you can turn on sort-based shuffle it should solve this problem. -Sandy On Tue, Feb 10, 2015 at 1:16 PM, Arun Luthra wrote: > Hi, > > I'm running Spar

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
new StreamingContext(sparkConf, Seconds(bucketSecs)) > > val sc = new SparkContext() > > On Tue, Feb 10, 2015 at 1:02 PM, Sandy Ryza > wrote: > >> Is the SparkContext you're using the same one that the StreamingContext >> wraps? If not, I don't think using

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
Is the SparkContext you're using the same one that the StreamingContext wraps? If not, I don't think using two is supported. -Sandy On Tue, Feb 10, 2015 at 9:58 AM, Jon Gregg wrote: > I'm still getting an error. Here's my code, which works successfully when >

Re: Resource allocation in yarn-cluster mode

2015-02-10 Thread Sandy Ryza
when yarn.scheduler.maximum-allocation-mb is exceeded. The reason it doesn't just use a smaller amount of memory is because it could be surprising to the user to find out they're silently getting less memory than they requested. Also, I don't think YARN exposes this up front so Spark has no way t

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-08 Thread Sandy Ryza
I wouldn't be concerned by those ResourceManager log messages. What would be concerning would be if the NodeManager reported that it was killing containers for exceeding resource limits. -Sandy On Wed, Feb 4, 2015 at 10:19 AM, Michael Albert wrote: > Greetings! > > Thanks t

Re: Spark impersonation

2015-02-07 Thread Sandy Ryza
https://issues.apache.org/jira/browse/SPARK-5493 currently tracks this. -Sandy On Mon, Feb 2, 2015 at 9:37 PM, Zhan Zhang wrote: > I think you can configure hadoop/hive to do impersonation. There is no > difference between secure or insecure hadoop cluster by using kinit. >

Re: getting error when submit spark with master as yarn

2015-02-07 Thread Sandy Ryza
--executor-memory and --driver-memory when you launch your Spark job. -Sandy On Sat, Feb 7, 2015 at 10:04 AM, sachin Singh wrote: > Hi, > when I am trying to execute my program as > spark-submit --master yarn --class com.mytestpack.analysis.SparkTest > sparktest-1.jar > >

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-06 Thread Sandy Ryza
ark.rdd.RDD[String]". > > Leaving it as an RDD and then constantly joining I think will be too slow > for a streaming job. > > On Thu, Feb 5, 2015 at 8:06 PM, Sandy Ryza > wrote: > >> Hi Jon, >> >> You'll need to put the file on HDFS (or whatever distribu

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
:8020/tmp/sparkTest/ file22.bin > parameters > > This is what I executed with different values in num-executors and > executor-memory. > What do you think there are too many executors for those HDDs? Could > it be the reason because of each executor takes more time? > > 2015-02-06 9:36

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
That's definitely surprising to me that you would be hitting a lot of GC for this scenario. Are you setting --executor-cores and --executor-memory? What are you setting them to? -Sandy On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz wrote: > Any idea why if I use more containers I g

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-05 Thread Sandy Ryza
Hi Jon, You'll need to put the file on HDFS (or whatever distributed filesystem you're running on) and load it from there. -Sandy On Thu, Feb 5, 2015 at 3:18 PM, YaoPau wrote: > I have a file "badFullIPs.csv" of bad IP addresses used for filtering. In > yarn-client

Re: Problems with GC and time to execute with different number of executors.

2015-02-04 Thread Sandy Ryza
Hi Guillermo, What exactly do you mean by "each iteration"? Are you caching data in memory? -Sandy On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz wrote: > I execute a job in Spark where I'm processing a file of 80Gb in HDFS. > I have 5 slaves: > (32cores /256Gb / 7p

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-04 Thread Sandy Ryza
Also, do you see any lines in the YARN NodeManager logs where it says that it's killing a container? -Sandy On Wed, Feb 4, 2015 at 8:56 AM, Imran Rashid wrote: > Hi Michael, > > judging from the logs, it seems that those tasks are just working a really > long time. If you

Re: running 2 spark applications in parallel on yarn

2015-02-01 Thread Sandy Ryza
Hi Tomer, Are you able to look in your NodeManager logs to see if the NodeManagers are killing any executors for exceeding memory limits? If you observe this, you can solve the problem by bumping up spark.yarn.executor.memoryOverhead. -Sandy On Sun, Feb 1, 2015 at 5:28 AM, Tomer Benyamini

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
Filed https://issues.apache.org/jira/browse/SPARK-5500 for this. -Sandy On Fri, Jan 30, 2015 at 11:59 AM, Aaron Davidson wrote: > Ah, this is in particular an issue due to sort-based shuffle (it was not > the case for hash-based shuffle, which would immediately serialize each > reco

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
* If you plan to directly cache Hadoop writable objects, you should first copy them using * a `map` function. This should probably say "direct cachingly *or directly shuffling*". To sort directly from a sequence file, the records need to be cloned first. -Sandy On Fri, Ja

Re: HW imbalance

2015-01-30 Thread Sandy Ryza
ase memory, > the more jobs you can run. > > This is of course assuming you could over subscribe a node in terms of cpu > cores if you have memory available. > > YMMV > > HTH > -Mike > > On Jan 30, 2015, at 7:10 AM, Sandy Ryza wrote: > > My answer was based off t

  1   2   3   >