singular value decomposition in Spark ML

2016-08-04 Thread Sandy Ryza
Hi, Is SVD or PCA in Spark ML (i.e. spark.ml parity with the mllib RowMatrix.computeSVD API) slated for any upcoming release? Many thanks for any guidance! -Sandy

Re: Content based window operation on Time-series data

2015-12-17 Thread Sandy Ryza
Hi Arun, A Java API was actually recently added to the library. It will be available in the next release. -Sandy On Thu, Dec 10, 2015 at 12:16 AM, Arun Verma wrote: > Thank you for your reply. It is a Scala and Python library. Is similar > library exists for Java? > > On Wed, Dec 9, 2015 at 1

Re: PySpark Lost Executors

2015-11-19 Thread Sandy Ryza
Hi Ross, This is most likely occurring because YARN is killing containers for exceeding physical memory limits. You can make this less likely to happen by bumping spark.yarn.executor.memoryOverhead to something higher than 10% of your spark.executor.memory. -Sandy On Thu, Nov 19, 2015 at 8:14 A

Re: SequenceFile and object reuse

2015-11-18 Thread Sandy Ryza
Hi Jeff, Many access patterns simply take the result of hadoopFile and use it to create some other object, and thus have no need for each input record to refer to a different object. In those cases, the current API is more performant than an alternative that would create an object for each record

Re: Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Sandy Ryza
Hi Nisrina, The resources you specify are shared by all jobs that run inside the application. -Sandy On Wed, Nov 4, 2015 at 9:24 AM, Nisrina Luthfiyati < nisrina.luthfiy...@gmail.com> wrote: > Hi all, > > I'm running some spark jobs in java on top of YARN by submitting one > application jar tha

Re: Spark tunning increase number of active tasks

2015-10-31 Thread Sandy Ryza
Hi Xiaochuan, The most likely cause of the "Lost container" issue is that YARN is killing container for exceeding memory limits. If this is the case, you should be able to find instances of "exceeding memory limits" in the application logs. http://blog.cloudera.com/blog/2015/03/how-to-tune-your-

Re: Spark 1.5 on CDH 5.4.0

2015-10-22 Thread Sandy Ryza
Hi Deenar, The version of Spark you have may not be compiled with YARN support. If you inspect the contents of the assembly jar, does org.apache.spark.deploy.yarn.ExecutorLauncher exist? If not, you'll need to find a version that does have the YARN classes. You can also build your own using the

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Sandy Ryza
Hi Anfernee, That's correct that each InputSplit will map to exactly a Spark partition. On YARN, each Spark executor maps to a single YARN container. Each executor can run multiple tasks over its lifetime, both parallel and sequentially. If you enable dynamic allocation, after the stage includi

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Sandy Ryza
0-0-28-96.ec2.internal --cores 8 --app-id >> application_1442869100946_0001 --user-class-path >> file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_56/__app__.jar >> 1> >> /var/log/hadoop-yarn/containers/applicat

Re: Spark on Yarn vs Standalone

2015-09-10 Thread Sandy Ryza
ecutor might be unresponsive because > of GC or it might occupy more memory than Yarn allows) > > > > On Tue, Sep 8, 2015 at 3:02 PM, Sandy Ryza > wrote: > >> Those settings seem reasonable to me. >> >> Are you observing performance that's worse than you wo

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Sandy Ryza
Java 7. FWIW I was just able to get it to work by increasing MaxPermSize to 256m. -Sandy On Wed, Sep 9, 2015 at 11:37 AM, Reynold Xin wrote: > Java 7 / 8? > > On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza > wrote: > >> I just upgraded the spark-timeseries >> <htt

Driver OOM after upgrading to 1.5

2015-09-09 Thread Sandy Ryza
I just upgraded the spark-timeseries project to run on top of 1.5, and I'm noticing that tests are failing with OOMEs. I ran a jmap -histo on the process and discovered the top heap items to be: 1:163428 22236064 2:163428

Re: Spark on Yarn vs Standalone

2015-09-08 Thread Sandy Ryza
> > Does it look good for you? (we run single heavy job on cluster) > > Alex > > On Mon, Sep 7, 2015 at 11:03 AM, Sandy Ryza > wrote: > >> Hi Alex, >> >> If they're both configured correctly, there's no reason that Spark >> Standalone shoul

Re: Spark on Yarn vs Standalone

2015-09-07 Thread Sandy Ryza
Hi Alex, If they're both configured correctly, there's no reason that Spark Standalone should provide performance or memory improvement over Spark on YARN. -Sandy On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov wrote: > Hi Everyone > > We are trying the latest aws emr-4.0.0 and Spark and m

Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-08-31 Thread Sandy Ryza
Hi Timothy, For your first question, you would need to look in the logs and provide additional information about why your job is failing. The SparkContext shutting down could happen for a variety of reasons. In the situation where you give more memory, but less memory overhead, and the job compl

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
did not see any GC error > there. Please guide. Thanks much. > > On Thu, Aug 20, 2015 at 8:14 PM, Sandy Ryza > wrote: > >> Moving this back onto user@ >> >> Regarding GC, can you look in the web UI and see whether the "GC time" >> metric dominates th

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
r first executor > gets lost things are messing. > On Aug 20, 2015 7:59 PM, "Sandy Ryza" wrote: > >> What sounds most likely is that you're hitting heavy garbage collection. >> Did you hit issues when the shuffle memory fraction was at its default of >>

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
What version of Spark are you using? Have you set any shuffle configs? On Wed, Aug 19, 2015 at 11:46 AM, unk1102 wrote: > I have one Spark job which seems to run fine but after one hour or so > executor start getting lost because of time out something like the > following > error > > cluster.ya

Re: Executors on multiple nodes

2015-08-16 Thread Sandy Ryza
Hi Mohit, It depends on whether dynamic allocation is turned on. If not, the number of executors is specified by the user with the --num-executors option. If dynamic allocation is turned on, refer to the doc for details: https://spark.apache.org/docs/1.4.0/job-scheduling.html#dynamic-resource-al

Re: Boosting spark.yarn.executor.memoryOverhead

2015-08-11 Thread Sandy Ryza
Hi Eric, This is likely because you are putting the parameter after the primary resource (latest_msmtdt_by_gridid_and_source.py), which makes it a parameter to your application instead of a parameter to Spark/ -Sandy On Wed, Aug 12, 2015 at 4:40 AM, Eric Bless wrote: > Previously I was getting

Re: Spark on YARN

2015-08-08 Thread Sandy Ryza
Hi Jem, Do they fail with any particular exception? Does YARN just never end up giving them resources? Does an application master start? If so, what are in its logs? If not, anything suspicious in the YARN ResourceManager logs? -Sandy On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker wrote: > Hi,

Re: [General Question] [Hadoop + Spark at scale] Spark Rack Awareness ?

2015-07-19 Thread Sandy Ryza
Hi Mike, Spark is rack-aware in its task scheduling. Currently Spark doesn't honor any locality preferences when scheduling executors, but this is being addressed in SPARK-4352, after which executor-scheduling will be rack-aware as well. -Sandy On Sat, Jul 18, 2015 at 6:25 PM, Mike Frampton wr

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Sandy Ryza
Can you try setting the spark.yarn.jar property to make sure it points to the jar you're thinking of? -Sandy On Fri, Jul 17, 2015 at 11:32 AM, Arun Ahuja wrote: > Yes, it's a YARN cluster and using spark-submit to run. I have SPARK_HOME > set to the directory above and using the spark-submit s

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Sandy Ryza
Hi Jonathan, This is a problem that has come up for us as well, because we'd like dynamic allocation to be turned on by default in some setups, but not break existing users with these properties. I'm hoping to figure out a way to reconcile these by Spark 1.5. -Sandy On Wed, Jul 15, 2015 at 3:18

Re: How to restrict disk space for spark caches on yarn?

2015-07-13 Thread Sandy Ryza
To clear one thing up: the space taken up by data that Spark caches on disk is not related to YARN's "local resource" / "application cache" concept. The latter is a way that YARN provides for distributing bits to worker nodes. The former is just usage of disk by Spark, which happens to be in a loc

Re: Pyspark not working on yarn-cluster mode

2015-07-10 Thread Sandy Ryza
To add to this, conceptually, it makes no sense to launch something in yarn-cluster mode by creating a SparkContext on the client - the whole point of yarn-cluster mode is that the SparkContext runs on the cluster, not on the client. On Thu, Jul 9, 2015 at 2:35 PM, Marcelo Vanzin wrote: > You ca

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza
checked it in the WEB > UI page of my cluster > > Also, i'm able to submit the same script in any of the nodes of the > cluster. > > That's why i don't understand whats happening. > > Thanks > > JG > > On Wed, Jul 8, 2015 at 5:26 PM, Sandy Ryz

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza
Hi JG, One way this can occur is that YARN doesn't have enough resources to run your job. Have you verified that it does? Are you able to submit using the same command from a node on the cluster? -Sandy On Wed, Jul 8, 2015 at 3:19 PM, jegordon wrote: > I'm trying to submit a spark job from a

Re: Executors requested are way less than what i actually got

2015-06-26 Thread Sandy Ryza
output=/user/dvasthimal/epdatasets/viewItem buffersize=128 > maxbuffersize=1068 maxResultSize=200G > > > > > On Thu, Jun 25, 2015 at 4:52 PM, Sandy Ryza > wrote: > >> How many nodes do you have, how much space is allocated to each node for >> YARN, how big are the exe

Re: Executors requested are way less than what i actually got

2015-06-25 Thread Sandy Ryza
How many nodes do you have, how much space is allocated to each node for YARN, how big are the executors you're requesting, and what else is running on the cluster? On Thu, Jun 25, 2015 at 3:57 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I run Spark App on Spark 1.3.1 over YARN. > > When i request --num-executor

Re: When to use underlying data management layer versus standalone Spark?

2015-06-24 Thread Sandy Ryza
Hi Michael, Spark itself is an execution engine, not a storage system. While it has facilities for caching data in memory, think about these the way you would think about a process on a single machine leveraging memory - the source data needs to be stored somewhere, and you need to be able to acc

Re: Spark launching without all of the requested YARN resources

2015-06-24 Thread Sandy Ryza
Hi Arun, You can achieve this by setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really high number and spark.scheduler.minRegisteredResourcesRatio to 1.0. -Sandy On Wed, Jun 24, 2015 at 2:21 AM, Steve Loughran wrote: > > On 24 Jun 2015, at 05:55, canan chen wrote: > > Why

Re: Velox Model Server

2015-06-20 Thread Sandy Ryza
Oops, that link was for Oryx 1. Here's the repo for Oryx 2: https://github.com/OryxProject/oryx On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza wrote: > Hi Debasish, > > The Oryx project (https://github.com/cloudera/oryx), which is Apache 2 > licensed, contains a model server that

Re: Velox Model Server

2015-06-20 Thread Sandy Ryza
Hi Debasish, The Oryx project (https://github.com/cloudera/oryx), which is Apache 2 licensed, contains a model server that can serve models built with MLlib. -Sandy On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl wrote: > Is velox NOT open source? > > > On Saturday, June 20, 2015, Debasish Das

Re: [SparkScore] Performance portal for Apache Spark

2015-06-17 Thread Sandy Ryza
This looks really awesome. On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie wrote: > Hi All > > We are happy to announce Performance portal for Apache Spark > http://01org.github.io/sparkscore/ ! > > The Performance Portal for Apache Spark provides performance data on the > Spark upsteam to the com

Re: deployment options for Spark and YARN w/ many app jar library dependencies

2015-06-17 Thread Sandy Ryza
Hi Matt, If you place your jars on HDFS in a public location, YARN will cache them on each node after the first download. You can also use the spark.executor.extraClassPath config to point to them. -Sandy On Wed, Jun 17, 2015 at 4:47 PM, Sweeney, Matt wrote: > Hi folks, > > I’m looking to d

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Sandy Ryza
Hi Patrick, I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic allocation in 1.4 that permitted requesting negative numbers of executors. Any chance you'd be able to try with the newer version and see if the problem persists? -Sandy On Fri, Jun 12, 2015 at 7:42 PM, Patrick Wo

Re: Determining number of executors within RDD

2015-06-10 Thread Sandy Ryza
On YARN, there is no concept of a Spark Worker. Multiple executors will be run per node without any effort required by the user, as long as all the executors fit within each node's resource limits. -Sandy On Wed, Jun 10, 2015 at 3:24 PM, Evo Eftimov wrote: > Yes i think it is ONE worker ONE e

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza
t;>> No, I am not. I run it with sbt «sbt "run-main Branchmark"». I thought >>> it was the same thing since I am passing all the configurations through the >>> application code. Is that the problem? >>> >>> On Thu, Jun 4, 2015 at 6:26 PM, Sandy Ryza >

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza
ith sbt «sbt "run-main Branchmark"». I thought it > was the same thing since I am passing all the configurations through the > application code. Is that the problem? > > On Thu, Jun 4, 2015 at 6:26 PM, Sandy Ryza > wrote: > >> Hi Saiph, >> >> Are you la

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza
Hi Saiph, Are you launching using spark-submit? -Sandy On Thu, Jun 4, 2015 at 10:20 AM, Saiph Kappa wrote: > Hi, > > I've been running my spark streaming application in standalone mode > without any worries. Now, I've been trying to run it on YARN (hadoop 2.7.0) > but I am having some problems

Re: data localisation in spark

2015-06-03 Thread Sandy Ryza
; > > does at converting DAG to stages it calculates executors required and then > acquire executors/worker nodes ? > > > On Tue, Jun 2, 2015 at 11:06 PM, Sandy Ryza > wrote: > >> It is not possible with JavaSparkContext either. The API mentioned below >> cu

Re: data localisation in spark

2015-06-02 Thread Sandy Ryza
It is not possible with JavaSparkContext either. The API mentioned below currently does not have any effect (we should document this). The primary difference between MR and Spark here is that MR runs each task in its own YARN container, while Spark runs multiple tasks within an executor, which ne

Re: data localisation in spark

2015-05-31 Thread Sandy Ryza
Hi Shushant, Spark currently makes no effort to request executors based on data locality (although it does try to schedule tasks within executors based on data locality). We're working on adding this capability at SPARK-4352 . -Sandy On Sun, May

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Sandy Ryza
Hi Corey, As of this PR https://github.com/apache/spark/pull/5297/files, this can be controlled with spark.yarn.submit.waitAppCompletion. -Sandy On Thu, May 28, 2015 at 11:48 AM, Corey Nolet wrote: > I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm > noticing the jvm t

Re: number of executors

2015-05-18 Thread Sandy Ryza
t; target/scala-2.10/simple-project_2.10-1.0.jar --class scala.SimpleApp is > working awesomely. Is there any documentations pointing to this ? > > Thanks, > Xiaohe > > On Tue, May 19, 2015 at 12:07 AM, Sandy Ryza > wrote: > >> Hi Xiaohe, >> >> The all

Re: number of executors

2015-05-18 Thread Sandy Ryza
*All On Mon, May 18, 2015 at 9:07 AM, Sandy Ryza wrote: > Hi Xiaohe, > > The all Spark options must go before the jar or they won't take effect. > > -Sandy > > On Sun, May 17, 2015 at 8:59 AM, xiaohe lan > wrote: > >> Sorry, them both are assigned task

Re: number of executors

2015-05-18 Thread Sandy Ryza
Hi Xiaohe, The all Spark options must go before the jar or they won't take effect. -Sandy On Sun, May 17, 2015 at 8:59 AM, xiaohe lan wrote: > Sorry, them both are assigned task actually. > > Aggregated Metrics by Executor > Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksInpu

Re: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread Sandy Ryza
Hi Deepak, I wrote a couple posts with a bunch of different information about how to tune Spark jobs. The second one might be helpful with how to think about tuning the number of partitions and resources? What kind of OOMEs are you hitting? http://blog.cloudera.com/blog/2015/03/how-to-tune-your

Re: Question about Memory Used and VCores Used

2015-04-29 Thread Sandy Ryza
Hi, Good question. The extra memory comes from spark.yarn.executor.memoryOverhead, the space used for the application master, and the way the YARN rounds requests up. This explains it in a little more detail: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ -Sand

Re: Running beyond physical memory limits

2015-04-15 Thread Sandy Ryza
The setting to increase is spark.yarn.executor.memoryOverhead On Wed, Apr 15, 2015 at 6:35 AM, Brahma Reddy Battula < brahmareddy.batt...@huawei.com> wrote: > Hello Sean Owen, > > Thanks for your reply..I"ll increase overhead memory and check it.. > > > Bytheway ,Any difference between 1.1 and 1.

Re: Spark: Using "node-local" files within functions?

2015-04-14 Thread Sandy Ryza
Hi Tobias, It should be possible to get an InputStream from an HDFS file. However, if your libraries only work directly on files, then maybe that wouldn't work? If that's the case and different tasks need different files, your way is probably the best way. If all tasks need the same file, a bett

Re: Rack locality

2015-04-13 Thread Sandy Ryza
Hi Riya, As far as I know, that is correct, unless Mesos fine-grained mode handles this in some mysterious way. -Sandy On Mon, Apr 13, 2015 at 2:09 PM, rcharaya wrote: > I want to use Rack locality feature of Apache Spark in my application. > > Is YARN the only resource manager which supports

Re: Spark Job Run Resource Estimation ?

2015-04-09 Thread Sandy Ryza
Hi Deepak, I'm going to shamelessly plug my blog post on tuning Spark: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ It talks about tuning executor size as well as how the number of tasks for a stage is calculated. -Sandy On Thu, Apr 9, 2015 at 9:21 AM, ÐΞ€ρ@Ҝ

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-06 Thread Sandy Ryza
for some x time? > > So, if the application is not able to have minimum n number of executors > within x period of time, then we should fail the application. > > Adding time factor here, will allow some window for spark to get more > executors allocated if some of them fails. >

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza
This isn't currently a capability that Spark has, though it has definitely been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The primary obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that each file corresponds to a single split, so the records correspond

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread Sandy Ryza
That's a good question, Twinkle. One solution could be to allow a maximum number of failures within any given time span. E.g. a max failures per hour property. -Sandy On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva < twinkle.sachd...@gmail.com> wrote: > Hi, > > In spark over YARN, there is

Re: Cross-compatibility of YARN shuffle service

2015-03-26 Thread Sandy Ryza
Hi Matt, I'm not sure whether we have documented compatibility guidelines here. However, a strong goal is to keep the external shuffle service compatible so that many versions of Spark can run against the same shuffle service. -Sandy On Wed, Mar 25, 2015 at 6:44 PM, Matt Cheah wrote: > Hi ever

Re: What is best way to run spark job in "yarn-cluster" mode from java program(servlet container) and NOT using spark-submit command.

2015-03-26 Thread Sandy Ryza
Creating a SparkContext and setting master as yarn-cluster unfortunately will not work. SPARK-4924 added APIs for doing this in Spark, but won't be included until 1.4. -Sandy On Tue, Mar 17, 2015 at 3:19 AM, Akhil Das wrote: > Create SparkContext set master as yarn-cluster then run it as a sta

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Sandy Ryza
Hi Sachin, It appears that the application master is failing. To figure out what's wrong you need to get the logs for the application master. -Sandy On Wed, Mar 25, 2015 at 7:05 AM, Sachin Singh wrote: > OS I am using Linux, > when I will run simply as master yarn, its running fine, > > Regar

Re: Is yarn-standalone mode deprecated?

2015-03-24 Thread Sandy Ryza
t; On Mon, Mar 23, 2015 at 1:13 PM, Sandy Ryza > wrote: > >> The former is deprecated. However, the latter is functionally equivalent >> to it. Both launch an app in what is now called "yarn-cluster" mode. >> >> Oozie now also has a native Spark action,

Re: How to avoid being killed by YARN node manager ?

2015-03-24 Thread Sandy Ryza
Hi Yuichiro, The way to avoid this is to boost spark.yarn.executor.memoryOverhead until the executors have enough off-heap memory to avoid going over their limits. -Sandy On Tue, Mar 24, 2015 at 11:49 AM, Yuichiro Sakamoto wrote: > Hello. > > We use ALS(Collaborative filtering) of Spark MLlib

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-24 Thread Sandy Ryza
This is why I was passing it via > --conf and retrieving System.getProperty("key") (which worked locally and > in yarn-client mode but not in yarn-cluster mode). I'm surprised why I > can't use it on the cluster while I can use it while local development and > testing.

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: "e04"

2015-03-24 Thread Sandy Ryza
Steve, that's correct, but the problem only shows up when different versions of the YARN jars are included on the classpath. -Sandy On Tue, Mar 24, 2015 at 6:29 AM, Steve Loughran wrote: > > > On 24 Mar 2015, at 02:10, Marcelo Vanzin wrote: > > > > This happens most probably because the Spark

Re: Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Sandy Ryza
Hi Bijay, The Shuffle Spill (Disk) is the total number of bytes written to disk by records spilled during the shuffle. The Shuffle Spill (Memory) is the amount of space the spilled records occupied in memory before they were spilled. These differ because the serialized format is more compact, an

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-23 Thread Sandy Ryza
Hi Emre, The --conf property is meant to work with yarn-cluster mode. System.getProperty("key") isn't guaranteed, but new SparkConf().get("key") should. Does it not? -Sandy On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc wrote: > Hello, > > According to Spark Documentation at > https://spark.apa

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
arn-cluster \ > --num-executors 3 \ > --driver-memory 4g \ > --executor-memory 2g \ > --executor-cores 1 \ > --queue thequeue \ > lib/spark-examples*.jar > > > I didnt see example of ./bin/spark-class in 1.2.0 documentation, so am > wondering if that is depreca

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
The mode is not deprecated, but the name "yarn-standalone" is now deprecated. It's now referred to as "yarn-cluster". -Sandy On Mon, Mar 23, 2015 at 11:49 AM, nitinkak001 wrote: > Is yarn-standalone mode deprecated in Spark now. The reason I am asking is > because while I can find it in 0.9.0

Re: No executors allocated on yarn with latest master branch

2015-03-09 Thread Sandy Ryza
> > On Sat, Feb 21, 2015 at 12:05 AM, Sandy Ryza > wrote: > >> Are you using the capacity scheduler or fifo scheduler without multi >> resource scheduling by any chance? >> >> On Thu, Feb 12, 2015 at 1:51 PM, Anders Arpteg >> wrote: >> >>>

Re: No executors allocated on yarn with latest master branch

2015-02-20 Thread Sandy Ryza
ontainermanager.ContainerManagerImpl: > Event EventType: FINISH_APPLICATION sent to absent application > application_1422406067005_0053 > > On Thu, Feb 12, 2015 at 10:38 PM, Sandy Ryza > wrote: > >> It seems unlikely to me that it would be a 2.2 issue, though not entire

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
> > Right? Again, thanks! > > Kelvin > > On Fri, Feb 20, 2015 at 11:50 AM, Sandy Ryza > wrote: > >> Hi Kelvin, >> >> spark.executor.memory controls the size of the executor heaps. >> >> spark.yarn.executor.memoryOverhead is the amount of m

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
hanks. > > Kelvin > > On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza > wrote: > >> If that's the error you're hitting, the fix is to boost >> spark.yarn.executor.memoryOverhead, which will put some extra room in >> between the executor heap sizes and the a

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman wrote: > A bit more context on th

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Are you specifying the executor memory, cores, or number of executors anywhere? If not, you won't be taking advantage of the full resources on the cluster. -Sandy On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen wrote: > None of this really points to the problem. These indicate that workers > died b

Re: build spark for cdh5

2015-02-18 Thread Sandy Ryza
Hi Koert, You should be using "-Phadoop-2.3" instead of "-Phadoop2.3". -Sandy On Wed, Feb 18, 2015 at 10:51 AM, Koert Kuipers wrote: > does anyone have the right maven invocation for cdh5 with yarn? > i tried: > $ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn -DskipTests clean > packa

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
;> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:551) >>> at >>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:155) >>> at >>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:178) >>>

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Sandy Ryza
What version of Java are you using? Core NLP dropped support for Java 7 in its 3.5.0 release. Also, the correct command line option is --jars, not --addJars. On Thu, Feb 12, 2015 at 12:03 PM, Deborah Siegel wrote: > Hi Abe, > I'm new to Spark as well, so someone else could answer better. A few

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
at >> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:551) >> at >> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:155) >> at >> org.apache.spark.deploy.SparkSubmit$.submit(SparkS

feeding DataFrames into predictive algorithms

2015-02-11 Thread Sandy Ryza
Hey All, I've been playing around with the new DataFrame and ML pipelines APIs and am having trouble accomplishing what seems like should be a fairly basic task. I have a DataFrame where each column is a Double. I'd like to turn this into a DataFrame with a features column and a label column tha

Re: No executors allocated on yarn with latest master branch

2015-02-11 Thread Sandy Ryza
Hi Anders, I just tried this out and was able to successfully acquire executors. Any strange log messages or additional color you can provide on your setup? Does yarn-client mode work? -Sandy On Wed, Feb 11, 2015 at 1:28 PM, Anders Arpteg wrote: > Hi, > > Compiled the latest master of Spark y

Re: Open file limit settings for Spark on Yarn job

2015-02-10 Thread Sandy Ryza
Hi Arun, The limit for the YARN user on the cluster nodes should be all that matters. What version of Spark are you using? If you can turn on sort-based shuffle it should solve this problem. -Sandy On Tue, Feb 10, 2015 at 1:16 PM, Arun Luthra wrote: > Hi, > > I'm running Spark on Yarn from a

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
new StreamingContext(sparkConf, Seconds(bucketSecs)) > > val sc = new SparkContext() > > On Tue, Feb 10, 2015 at 1:02 PM, Sandy Ryza > wrote: > >> Is the SparkContext you're using the same one that the StreamingContext >> wraps? If not, I don't think using

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
N YarnClusterScheduler: Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient memory > 15/02/10 12:09:21 WARN YarnClusterScheduler: Initial job has not accepted > any resources; check your cluster UI t

Re: Resource allocation in yarn-cluster mode

2015-02-10 Thread Sandy Ryza
Hi Zsolt, spark.executor.memory, spark.executor.cores, and spark.executor.instances are only honored when launching through spark-submit. Marcelo is working on a Spark launcher (SPARK-4924) that will enable using these programmatically. That's correct that the error comes up when yarn.scheduler.

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-08 Thread Sandy Ryza
adoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue > (ResourceManager Event Processor): Reserved container > application=application_1422834185427_0088 resource= > queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=usedCapacity=0.982, > absoluteUsedCapacity=0.9822222, numApps=1, numContainers=26 &

Re: Spark impersonation

2015-02-07 Thread Sandy Ryza
https://issues.apache.org/jira/browse/SPARK-5493 currently tracks this. -Sandy On Mon, Feb 2, 2015 at 9:37 PM, Zhan Zhang wrote: > I think you can configure hadoop/hive to do impersonation. There is no > difference between secure or insecure hadoop cluster by using kinit. > > Thanks. > > Zh

Re: getting error when submit spark with master as yarn

2015-02-07 Thread Sandy Ryza
Hi Sachin, In your YARN configuration, either yarn.nodemanager.resource.memory-mb is 1024 on your nodes or yarn.scheduler.maximum-allocation-mb is set to 1024. If you have more than 1024 MB on each node, you should bump these properties. Otherwise, you should request fewer resources by setting --

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-06 Thread Sandy Ryza
ark.rdd.RDD[String]". > > Leaving it as an RDD and then constantly joining I think will be too slow > for a streaming job. > > On Thu, Feb 5, 2015 at 8:06 PM, Sandy Ryza > wrote: > >> Hi Jon, >> >> You'll need to put the file on HDFS (or whatever distribu

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
:8020/tmp/sparkTest/ file22.bin > parameters > > This is what I executed with different values in num-executors and > executor-memory. > What do you think there are too many executors for those HDDs? Could > it be the reason because of each executor takes more time? > > 2015-02-06 9:36

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
ng.saveAsTextFile(url + "/output/" + System.currentTimeMillis()+ > "/"); > > > > The parse function just takes an array of bytes and applies some > > transformations like,,, > > [0..3] an integer, [4...20] an String, [21..27] another String and so on. > > >

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-05 Thread Sandy Ryza
Hi Jon, You'll need to put the file on HDFS (or whatever distributed filesystem you're running on) and load it from there. -Sandy On Thu, Feb 5, 2015 at 3:18 PM, YaoPau wrote: > I have a file "badFullIPs.csv" of bad IP addresses used for filtering. In > yarn-client mode, I simply read it off

Re: Problems with GC and time to execute with different number of executors.

2015-02-04 Thread Sandy Ryza
Hi Guillermo, What exactly do you mean by "each iteration"? Are you caching data in memory? -Sandy On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz wrote: > I execute a job in Spark where I'm processing a file of 80Gb in HDFS. > I have 5 slaves: > (32cores /256Gb / 7physical disks) x 5 > > I h

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-04 Thread Sandy Ryza
Also, do you see any lines in the YARN NodeManager logs where it says that it's killing a container? -Sandy On Wed, Feb 4, 2015 at 8:56 AM, Imran Rashid wrote: > Hi Michael, > > judging from the logs, it seems that those tasks are just working a really > long time. If you have long running tas

Re: running 2 spark applications in parallel on yarn

2015-02-01 Thread Sandy Ryza
Hi Tomer, Are you able to look in your NodeManager logs to see if the NodeManagers are killing any executors for exceeding memory limits? If you observe this, you can solve the problem by bumping up spark.yarn.executor.memoryOverhead. -Sandy On Sun, Feb 1, 2015 at 5:28 AM, Tomer Benyamini wrot

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
rd rather than holding many in memory at once). The documentation > should be updated. > > On Fri, Jan 30, 2015 at 11:27 AM, Sandy Ryza > wrote: > >> Hi Andrew, >> >> Here's a note from the doc for sequenceFile: >> >> * '''Note:&

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
Hi Andrew, Here's a note from the doc for sequenceFile: * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * record, directly caching the returned RDD will create many references to the same object. * If you plan to directly cache Hadoop writab

Re: HW imbalance

2015-01-30 Thread Sandy Ryza
ase memory, > the more jobs you can run. > > This is of course assuming you could over subscribe a node in terms of cpu > cores if you have memory available. > > YMMV > > HTH > -Mike > > On Jan 30, 2015, at 7:10 AM, Sandy Ryza wrote: > > My answer was based off t

Re: HW imbalance

2015-01-29 Thread Sandy Ryza
I understood the question raised by the OP, its more about a > heterogeneous cluster than spark. > > -Mike > > On Jan 26, 2015, at 5:02 PM, Sandy Ryza wrote: > > Hi Antony, > > Unfortunately, all executors for any single Spark application must have > the same

Re: RDD caching, memory & network input

2015-01-28 Thread Sandy Ryza
Hi Fanilo, How many cores are you using per executor? Are you aware that you can combat the "container is running beyond physical memory limits" error by bumping the spark.yarn.executor.memoryOverhead property? Also, are you caching the parsed version or the text? -Sandy On Wed, Jan 28, 2015 a

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Sandy Ryza
Hi Antony, If you look in the YARN NodeManager logs, do you see that it's killing the executors? Or are they crashing for a different reason? -Sandy On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi wrote: > Hi, > > I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors > crashe

  1   2   3   >