Re: No event log in /tmp/spark-events

2016-03-08 Thread Andrew Or
Hi Patrick, I think he means just write `/tmp/sparkserverlog` instead of `file:/tmp/sparkserverlog`. However, I think both should work. What mode are you running in, client mode (the default) or cluster mode? If the latter your driver will be run on the cluster, and so your event logs won't be on

Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Andrew Or
Hi Yuval, if you start the Workers with `spark.shuffle.service.enabled = true` then the workers will each start a shuffle service automatically. No need to start the shuffle services yourself separately. -Andrew 2016-03-08 11:21 GMT-08:00 Silvio Fiorito : > There’s a script to start it up under

Re: automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Andrew Or
Hi Alex, Yes, you can set `spark.cleaner.ttl`: http://spark.apache.org/docs/1.6.0/configuration.html, but I would not recommend it! We are actually removing this property in Spark 2.0 because it has caused problems for many users in the past. In particular, if you accidentally use a variable that

Re: Read Accumulator value while running

2016-01-13 Thread Andrew Or
Hi Kira, As you suspected, accumulator values are only updated after the task completes. We do send accumulator updates from the executors to the driver on periodic heartbeats, but these only concern internal accumulators, not the ones created by the user. In short, I'm afraid there is not curren

Re: Can't submit job to stand alone cluster

2015-12-30 Thread Andrew Or
ved > > Sent from my iPhone > > On Dec 29, 2015, at 2:43 PM, Annabel Melongo < > melongo_anna...@yahoo.com> wrote: > > Thanks Andrew for this awesome explanation [image: *:) happy] > > > On Tuesday, December 29, 2015 5:30 PM, Andrew Or < > and...@databricks.c

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
application depends on, you should specify them through the > --jars flag using comma as a delimiter (e.g. --jars jar1,jar2). > > That can't be true; this is only the case when Spark runs on top of YARN. > Please correct me, if I'm wrong. > > Thanks > &

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html> > Preview by Yahoo > > > > > On Tuesday, December 29, 2015 2:42 PM, Andrew Or > wrote: > > > The confusion here is the expression "standalone cluster mo

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-29 Thread Andrew Or
> > External shuffle service is backward compatible, so if you deployed 1.6 > shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications. Actually, it just happens to be backward compatible because we didn't change the shuffle file formats. This may not necessarily be the case movi

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
thread in ApplicationMaster; use --jars > option with a globally visible path to said jar > 3. Yarn Client-mode: client and driver run on the same machine. driver > is *NOT* a thread in ApplicationMaster; use --packages to submit a jar > > > On Tuesday, December 29, 2015 1:54 PM,

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
Hi Greg, It's actually intentional for standalone cluster mode to not upload jars. One of the reasons why YARN takes at least 10 seconds before running any simple application is because there's a lot of random overhead (e.g. putting jars in HDFS). If this missing functionality is not documented so

Re: which aws instance type for shuffle performance

2015-12-18 Thread Andrew Or
Hi Rastan, Unless you're using off-heap memory or starting multiple executors per machine, I would recommend the r3.2xlarge option, since you don't actually want gigantic heaps (100GB is more than enough). I've personally run Spark on a very large scale with r3.8xlarge instances, but I've been usi

Re: imposed dynamic resource allocation

2015-12-18 Thread Andrew Or
Hi Antony, The configuration to enable dynamic allocation is per-application. If you only wish to enable this for one of your applications, just set `spark.dynamicAllocation.enabled` to true for that application only. The way it works under the hood is that application will start sending requests

Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Andrew Or
Hi Roy, I believe Spark just gets its application ID from YARN, so you can just do `sc.applicationId`. -Andrew 2015-12-18 0:14 GMT-08:00 Deepak Sharma : > I have never tried this but there is yarn client api's that you can use in > your spark program to get the application id. > Here is the lin

Re: Limit of application submission to cluster

2015-12-18 Thread Andrew Or
Hi Saif, have you verified that the cluster has enough resources for all 4 programs? -Andrew 2015-12-18 5:52 GMT-08:00 : > Hello everyone, > > I am testing some parallel program submission to a stand alone cluster. > Everything works alright, the problem is, for some reason, I can’t submit > mor

Re: Spark job submission REST API

2015-12-10 Thread Andrew Or
Hello, The hidden API was implemented for use internally and there are no plans to make it public at this point. It was originally introduced to provide backward compatibility in submission protocol across multiple versions of Spark. A full-fledged stable REST API for submitting applications would

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Andrew Or
Hi Andy, You must be running in cluster mode. The Spark Master accepts client mode submissions on port 7077 and cluster mode submissions on port 6066. This is because standalone cluster mode uses a REST API to submit applications by default. If you submit to port 6066 instead the warning should go

Re: create a table for csv files

2015-11-19 Thread Andrew Or
There's not an easy way. The closest thing you can do is: import org.apache.spark.sql.functions._ val df = ... df.withColumn("id", monotonicallyIncreasingId()) -Andrew 2015-11-19 8:23 GMT-08:00 xiaohe lan : > Hi, > > I have some csv file in HDFS with headers like col1, col2, col3, I want to >

Re: Spark 1.5.1 Dynamic Resource Allocation

2015-11-09 Thread Andrew Or
Hi Tom, I believe a workaround is to set `spark.dynamicAllocation.initialExecutors` to 0. As others have mentioned, from Spark 1.5.2 onwards this should no longer be necessary. -Andrew 2015-11-09 8:19 GMT-08:00 Jonathan Kelly : > Tom, > > You might be hitting https://issues.apache.org/jira/brow

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Andrew Or
Hi all, Both the history server and the shuffle service are backward compatible, but not forward compatible. This means as long as you have the latest version of history server / shuffle service running in your cluster then you're fine (you don't need multiple of them). That said, an old shuffle

Re: Why are executors on slave never used?

2015-09-21 Thread Andrew Or
Hi Joshua, What cluster manager are you using, standalone or YARN? (Note that standalone here does not mean local mode). If standalone, you need to do `setMaster("spark://[CLUSTER_URL]:7077")`, where CLUSTER_URL is the machine that started the standalone Master. If YARN, you need to do `setMaster

Re: Spark ec2 lunch problem

2015-08-24 Thread Andrew Or
Hey Garry, Have you verified that your particular VPC and subnet are open to the world? In particular, have you verified the route table attached to your VPC / subnet contains an internet gateway open to the public? I've run into this issue myself recently and that was the problem for me. -Andre

Re: DAG related query

2015-08-20 Thread Andrew Or
Hi Bahubali, Once RDDs are created, they are immutable (in most cases). In your case you end up with 3 RDDs: (1) the original rdd1 that reads from the text file (2) rdd2, that applies a map function on (1), and (3) the new rdd1 that applies a map function on (2) There's no cycle because you have

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread Andrew Or
Hi Canan, The event log dir is a per-application setting whereas the history server is an independent service that serves history UIs from many applications. If you use history server locally then the `spark.history.fs.logDirectory` will happen to point to `spark.eventLog.dir`, but the use case it

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Andrew Or
rything on 1 node, it looks like it's not grabbing the extra nodes. > > On Wed, Aug 19, 2015 at 8:43 AM, Axel Dahl wrote: > >> That worked great, thanks Andrew. >> >> On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or wrote: >> >>> Hi Axel, >>> >

Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Andrew Or
gt; So the reason why key value pairs with same keys are always found in a > single buckets in Hash based shuffle but not in Sort is because in > sort-shuffle each mapper writes a single partitioned file, and it is up to > the reducer to fetch correct partitions from the the files ? > > On

Re: dse spark-submit multiple jars issue

2015-08-18 Thread Andrew Or
Hi Satish, The problem is that `--jars` accepts a comma-delimited list of jars! E.g. spark-submit ... --jars lib1.jar,lib2.jar,lib3.jar main.jar where main.jar is your main application jar (the one that starts a SparkContext), and lib*.jar refer to additional libraries that your main application

Re: Difference between Sort based and Hash based shuffle

2015-08-18 Thread Andrew Or
Hi Muhammad, On a high level, in hash-based shuffle each mapper M writes R shuffle files, one for each reducer where R is the number of reduce partitions. This results in M * R shuffle files. Since it is not uncommon for M and R to be O(1000), this quickly becomes expensive. An optimization with h

Re: Programmatically create SparkContext on YARN

2015-08-18 Thread Andrew Or
Hi Andreas, I believe the distinction is not between standalone and YARN mode, but between client and cluster mode. In client mode, your Spark submit JVM runs your driver code. In cluster mode, one of the workers (or NodeManagers if you're using YARN) in the cluster runs your driver code. In the

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Andrew Or
Hi Axel, You can try setting `spark.deploy.spreadOut` to false (through your conf/spark-defaults.conf file). What this does is essentially try to schedule as many cores on one worker as possible before spilling over to other workers. Note that you *must* restart the cluster through the sbin script

Re: Why standalone mode don't allow to set num-executor ?

2015-08-18 Thread Andrew Or
Hi Canan, This is mainly for legacy reasons. The default behavior in standalone in mode is that the application grabs all available resources in the cluster. This effectively means we want one executor per worker, where each executor grabs all the available cores and memory on that worker. In this

Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread Andrew Or
Hi Canan, TestSQLContext is no longer a singleton but now a class. It is never meant to be a fully public API, but if you wish to use it you can just instantiate a new one: val sqlContext = new TestSQLContext or just create a new SQLContext from a SparkContext. -Andrew 2015-08-15 20:33 GMT-07:0

Re: Spark master driver UI: How to keep it after process finished?

2015-08-08 Thread Andrew Or
Hi Saif, You need to run your application with `spark.eventLog.enabled` set to true. Then if you are using standalone mode, you can view the Master UI at port 8080. Otherwise, you may start a history server through `sbin/start-history-server.sh`, which by default starts the history UI at port 1808

Re: No event logs in yarn-cluster mode

2015-08-01 Thread Andrew Or
Hi Akmal, It might be on HDFS, since you provided a relative path /opt/spark/spark-events to `spark.eventLog.dir`. -Andrew 2015-08-01 9:25 GMT-07:00 Akmal Abbasov : > Hi, I am trying to configure a history server for application. > When I running locally(./run-example SparkPi), the event logs a

Re: spark.executor.memory and spark.driver.memory have no effect in yarn-cluster mode (1.4.x)?

2015-07-22 Thread Andrew Or
Hi Michael, In general, driver related properties should not be set through the SparkConf. This is because by the time the SparkConf is created, we have already started the driver JVM, so it's too late to change the memory, class paths and other properties. In cluster mode, executor related prope

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
Cheers, > Dan > > > > 2015-07-22 2:20 GMT-05:00 Andrew Or : > >> Hi Dan, >> >> If the map is small enough, you can just broadcast it, can't you? It >> doesn't have to be an RDD. Here's an example of broadcasting an array and >> using it o

Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or
22 11:49 GMT-07:00 Andrew Or : > Hi Srikanth, > > It does look like a bug. Did you set `spark.executor.cores` in your > application by any chance? > > -Andrew > > 2015-07-22 8:05 GMT-07:00 Srikanth : > >> Hello, >> >> I've set spark.deploy.spreadOut

Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or
Hi Srikanth, It does look like a bug. Did you set `spark.executor.cores` in your application by any chance? -Andrew 2015-07-22 8:05 GMT-07:00 Srikanth : > Hello, > > I've set spark.deploy.spreadOut=false in spark-env.sh. > >> export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4 >> -Dspark.de

Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-22 Thread Andrew Or
Hi, It would be whatever's left in the JVM. This is not explicitly controlled by a fraction like storage or shuffle. However, the computation usually doesn't need to use that much space. In my experience it's almost always the caching or the aggregation during shuffles that's the most memory inten

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
Hi Dan, If the map is small enough, you can just broadcast it, can't you? It doesn't have to be an RDD. Here's an example of broadcasting an array and using it on the executors: https://github.com/apache/spark/blob/c03299a18b4e076cabb4b7833a1e7632c5c0dabe/examples/src/main/scala/org/apache/spark/e

Re: Spark spark.shuffle.memoryFraction has no affect

2015-07-22 Thread Andrew Or
Hi, The setting of 0.2 / 0.6 looks reasonable to me. Since you are not using caching at all, have you tried trying something more extreme, like 0.1 / 0.9? Since disabling spark.shuffle.spill didn't cause an OOM this setting should be fine. Also, one thing you could do is to verify the shuffle byte

Re: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Or
Hi Andrew, Based on your driver logs, it seems the issue is that the shuffle service is actually not running on the NodeManagers, but your application is trying to provide a "spark_shuffle" secret anyway. One way to verify whether the shuffle service is actually started is to look at the NodeManag

Re: The auxService:spark_shuffle does not exist

2015-07-17 Thread Andrew Or
Hi all, Did you forget to restart the node managers after editing yarn-site.xml by any chance? -Andrew 2015-07-17 8:32 GMT-07:00 Andrew Lee : > I have encountered the same problem after following the document. > > Here's my spark-defaults.conf > > spark.shuffle.service.enabled true > spark.dyna

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Andrew Or
Yeah, we could make it a log a warning instead. 2015-07-15 14:29 GMT-07:00 Kelly, Jonathan : > Thanks! Is there an existing JIRA I should watch? > > > ~ Jonathan > > From: Sandy Ryza > Date: Wednesday, July 15, 2015 at 2:27 PM > To: Jonathan Kelly > Cc: "user@spark.apache.org" > Subject: R

Re: How to restrict disk space for spark caches on yarn?

2015-07-10 Thread Andrew Or
Hi Peter, AFAIK Spark assumes infinite disk space, so there isn't really a way to limit how much space it uses. Unfortunately I'm not aware of a simpler workaround than to simply provision your cluster with more disk space. By the way, are you sure that it's disk space that exceeded the limit, but

Re: Starting Spark-Application without explicit submission to cluster?

2015-07-10 Thread Andrew Or
Hi Jan, Most SparkContext constructors are there for legacy reasons. The point of going through spark-submit is to set up all the classpaths, system properties, and resolve URIs properly *with respect to the deployment mode*. For instance, jars are distributed differently between YARN cluster mode

Re: spark-submit

2015-07-10 Thread Andrew Or
Hi Ashutosh, I believe the class is org.apache.spark.*examples.*graphx.Analytics? If you're running page rank on live journal you could just use org.apache.spark.examples.graphx.LiveJournalPageRank. -Andrew 2015-07-10 3:42 GMT-07:00 AshutoshRaghuvanshi < ashutosh.raghuvans...@gmail.com>: > when

Re: Spark serialization in closure

2015-07-09 Thread Andrew Or
Hi Chen, I believe the issue is that `object foo` is a member of `object testing`, so the only way to access `object foo` is to first pull `object testing` into the closure, then access a pointer to get to `object foo`. There are two workarounds that I'm aware of: (1) Move `object foo` outside of

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Andrew Or
Hi Lincoln, I've noticed this myself. I believe it's a new issue that only affects local mode. I've filed a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-8911 2015-07-08 14:20 GMT-07:00 Lincoln Atkinson : > Brilliant! Thanks. > > > > *From:* Feynman Liang [mailto:fli...@databrick

Re: Submitting Spark Applications using Spark Submit

2015-06-22 Thread Andrew Or
amazonaws.com/10.165.103.16:7077 > <http://ec2-XXX.compute-1.amazonaws.com/10.165.103.16:7077> *I don’t > understand where it gets the *10.165.103.16 > <http://ec2-XXX.compute-1.amazonaws.com/10.165.103.16:7077> *from. I > never specify that in the master url command line par

Re: PySpark on YARN "port out of range"

2015-06-22 Thread Andrew Or
is actually outputting (in hopes > that yields a clue)? > > On Jun 19, 2015, at 6:47 PM, Andrew Or wrote: > > Hm, one thing to see is whether the same port appears many times (1315905645). > The way pyspark works today is that the JVM reads the port from the stdout > of the

Re: Submitting Spark Applications using Spark Submit

2015-06-19 Thread Andrew Or
Thanks, > Raghav > > > On Friday, June 19, 2015, Andrew Or wrote: > >> Hi Raghav, >> >> If you want to make changes to Spark and run your application with it, >> you may follow these steps. >> >> 1. git clone g...@github.com:apache/spark >

Re: Submitting Spark Applications using Spark Submit

2015-06-19 Thread Andrew Or
Hi Raghav, If you want to make changes to Spark and run your application with it, you may follow these steps. 1. git clone g...@github.com:apache/spark 2. cd spark; build/mvn clean package -DskipTests [...] 3. make local changes 4. build/mvn package -DskipTests [...] (no need to clean again here)

Re: What files/folders/jars spark-submit script depend on ?

2015-06-19 Thread Andrew Or
Hi Elkhan, Spark submit depends on several things: the launcher jar (1.3.0+ only), the spark-core jar, and the spark-yarn jar (in your case). Why do you want to put it in HDFS though? AFAIK you can't execute scripts directly from HDFS; you need to copy them to a local file system first. I don't se

Re: Abount Jobs UI in yarn-client mode

2015-06-19 Thread Andrew Or
Did you make sure that the YARN IP is not an internal address? If it still doesn't work then it seems like an issue on the YARN side... 2015-06-19 8:48 GMT-07:00 Sea <261810...@qq.com>: > Hi, all: > I run spark on yarn, I want to see the Jobs UI http://ip:4040/, > but it redirect to http:// > ${

Re: Spark on Yarn - How to configure

2015-06-19 Thread Andrew Or
Hi Ashish, For Spark on YARN, you actually only need the Spark files on one machine - the submission client. This machine could even live outside of the cluster. Then all you need to do is point YARN_CONF_DIR to the directory containing your hadoop configuration files (e.g. yarn-site.xml) on that

Re: PySpark on YARN "port out of range"

2015-06-19 Thread Andrew Or
Hm, one thing to see is whether the same port appears many times (1315905645). The way pyspark works today is that the JVM reads the port from the stdout of the python process. If there is some interference in output from the python side (e.g. any print statements, exception messages), then the Jav

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Andrew Or
Hi Patrick, The fix you need is SPARK-6954: https://github.com/apache/spark/pull/5704. If possible, you may cherry-pick the following commit into your Spark deployment and it should resolve the issue: https://github.com/apache/spark/commit/98ac39d2f5828fbdad8c9a4e563ad1169e3b9948 Note that this

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Andrew Or
Hi Peng, Setting properties through --conf should still work in Spark 1.4. From the warning it looks like the config you are trying to set does not start with the prefix "spark.". What is the config that you are trying to set? -Andrew 2015-06-12 11:17 GMT-07:00 Peng Cheng : > In Spark <1.3.x, t

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Andrew Or
Hi Deepak, This is a notorious bug that is being tracked at https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one source of this bug (it turns out Snappy had a bug in buffer reuse that caused data corruption). There are other known sources that are being addressed in outstanding patc

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Andrew Or
Hi all, As the author of the dynamic allocation feature I can offer a few insights here. Gerard's explanation was both correct and concise: dynamic allocation is not intended to be used in Spark streaming at the moment (1.4 or before). This is because of two things: (1) Number of receivers is ne

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-11 Thread Andrew Or
Hi Jianshi, For YARN, there may be an issue with how a recently patch changes the accessibility of the shuffle files by the external shuffle service: https://issues.apache.org/jira/browse/SPARK-5655. It is likely that you will hit this with 1.2.1, actually. For this reason I would have to recommen

Re: Aggregate order semantics when spilling

2015-01-20 Thread Andrew Or
Hi Justin, I believe the intended semantics of groupByKey or cogroup is that the ordering *within a key *is not preserved if you spill. In fact, the test cases for the ExternalAppendOnlyMap only assert that the Set representation of the results is as expected (see this line

Re: spark-submit --py-files remote: "Only local additional python files are supported"

2015-01-20 Thread Andrew Or
Hi Vladimir, Yes, as the error messages suggests, PySpark currently only supports local files. This does not mean it only runs in local mode, however; you can still run PySpark on any cluster manager (though only in client mode). All this means is that your python files must be on your local file

Re: PySpark Client

2015-01-20 Thread Andrew Or
Hi Chris, Short answer is no, not yet. Longer answer is that PySpark only supports client mode, which means your driver runs on the same machine as your submission client. By corollary this means your submission client must currently depend on all of Spark and its dependencies. There is a patch t

Re: Failing jobs runs twice

2015-01-13 Thread Andrew Or
Hi Anders, are you using YARN by any chance? 2015-01-13 0:32 GMT-08:00 Anders Arpteg : > Since starting using Spark 1.2, I've experienced an annoying issue with > failing apps that gets executed twice. I'm not talking about tasks inside a > job, that should be executed multiple times before faili

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2015-01-07 Thread Andrew Or
Did you end up getting it working? By the way this might be a nicer view of the docs: https://github.com/apache/spark/blob/60e2d9e2902b132b14191c9791c71e8f0d42ce9d/docs/job-scheduling.md We will update the latest Spark docs to include this shortly. -Andrew 2015-01-04 4:44 GMT-08:00 Tsuyoshi Ozawa

Re: How to increase parallelism in Yarn

2014-12-18 Thread Andrew Or
Hi Suman, I'll assume that you are using spark submit to run your application. You can pass the --num-executors flag to ask for more containers. If you want to allocate more memory for each executor, you may also pass in the --executor-memory flag (this accepts a string in the format 1g, 512m etc.

Re: Standalone Spark program

2014-12-18 Thread Andrew Or
Hey Akshat, What is the class that is not found, is it a Spark class or classes that you define in your own application? If the latter, then Akhil's solution should work (alternatively you can also pass the jar through the --jars command line option in spark-submit). If it's a Spark class, howeve

Re: spark-submit on YARN is slow

2014-12-05 Thread Andrew Or
;>>> in 55s but on YARN, the query was still running 30min later. Would the hard >>>> coded sleeps potentially be in play here? >>>> On Fri, Dec 5, 2014 at 11:23 Sandy Ryza >>>> wrote: >>>> >>>>> Hi Tobias, >>>>>

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Andrew Or
Increasing max failures is a way to do it, but it's probably a better idea to keep your tasks from failing in the first place. Are your tasks failing with exceptions from Spark or your application code? If from Spark, what is the stack trace? There might be a legitimate Spark bug such that even inc

Re: Any ideas why a few tasks would stall

2014-12-05 Thread Andrew Or
Hi Steve et al., It is possible that there's just a lot of skew in your data, in which case repartitioning is a good idea. Depending on how large your input data is and how much skew you have, you may want to repartition to a larger number of partitions. By the way you can just call rdd.repartitio

Re: Issue in executing Spark Application from Eclipse

2014-12-05 Thread Andrew Or
Hey Stuti, Did you start your standalone Master and Workers? You can do this through sbin/start-all.sh (see http://spark.apache.org/docs/latest/spark-standalone.html). Otherwise, I would recommend launching your application from the command line through bin/spark-submit. I am not sure if we offici

Re: Monitoring Spark

2014-12-05 Thread Andrew Or
If you're only interested in a particular instant, a simpler way is to check the executors page on the Spark UI: http://spark.apache.org/docs/latest/monitoring.html. By default each executor runs one task per core, so you can see how many tasks are being run at a given time and this translates dire

Re: spark-submit on YARN is slow

2014-12-05 Thread Andrew Or
Hey Tobias, As you suspect, the reason why it's slow is because the resource manager in YARN takes a while to grant resources. This is because YARN needs to first set up the application master container, and then this AM needs to request more containers for Spark executors. I think this accounts f

Re: Unable to run applications on clusters on EC2

2014-12-05 Thread Andrew Or
Hey, the default port is 7077. Not sure if you actually meant to put 7070. As a rule of thumb, you can go to the Master web UI and copy and paste the URL at the top left corner. That almost always works unless your cluster has a weird proxy set up. 2014-12-04 14:26 GMT-08:00 Xingwei Yang : > I th

Re: Spark streaming for v1.1.1 - unable to start application

2014-12-05 Thread Andrew Or
Hey Sourav, are you able to run a simple shuffle in a spark-shell? 2014-12-05 1:20 GMT-08:00 Shao, Saisai : > Hi, > > > > I don’t think it’s a problem of Spark Streaming, seeing for call stack, > it’s the problem when BlockManager starting to initializing itself. Would > you mind checking your c

Re: Problem creating EC2 cluster using spark-ec2

2014-12-03 Thread Andrew Or
This should be fixed now. Thanks for bringing this to our attention. 2014-12-03 13:31 GMT-08:00 Andrew Or : > Yeah this is currently broken for 1.1.1. I will submit a fix later today. > > 2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman < > shiva...@eecs.berkeley.edu>: > >

Re: Problem creating EC2 cluster using spark-ec2

2014-12-03 Thread Andrew Or
Yeah this is currently broken for 1.1.1. I will submit a fix later today. 2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman : > +Andrew > > Actually I think this is because we haven't uploaded the Spark binaries to > cloudfront / pushed the change to mesos/spark-ec2. > > Andrew, can you take care

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Andrew Or
Kuntsman*, *Big Data Engineer* > http://www.totango.com > > On Tue, Dec 2, 2014 at 11:36 PM, Andrew Or wrote: > >> I am happy to announce the availability of Spark 1.1.1! This is a >> maintenance release with many bug fixes, most of which are concentrated in >> the

Announcing Spark 1.1.1!

2014-12-02 Thread Andrew Or
I am happy to announce the availability of Spark 1.1.1! This is a maintenance release with many bug fixes, most of which are concentrated in the core. This list includes various fixes to sort-based shuffle, memory leak, and spilling issues. Contributions from this release came from 55 developers.

Re: Spark 1.1.1 released but not available on maven repositories

2014-11-28 Thread Andrew Or
Hi Luis, There seems to be a delay in the 1.1.1 artifacts being pushed to our apache mirrors. We are working with the infra people to get them up as soon as possible. Unfortunately, due to the national holiday weekend in the US this may take a little longer than expected, however. For now you may

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Andrew Or
Hey Egor, Have you checked the AM logs? My guess is that it threw an exception or something such that no executors (not even the initial set) have registered with your driver. You may already know this, but you can go to the http://:8088 page and click into the application to access this. Alternat

Re: No module named pyspark - latest built

2014-11-14 Thread Andrew Or
-08:00 jamborta : > it was built with 1.6 (tried 1.7, too) > > On Thu, Nov 13, 2014 at 2:52 AM, Andrew Or-2 [via Apache Spark User > List] <[hidden email] <http://user/SendEmail.jtp?type=node&node=18833&i=0>> > wrote: > > > Hey Jamborta, > > > > What

Re: No module named pyspark - latest built

2014-11-12 Thread Andrew Or
Hey Jamborta, What java version did you build the jar with? 2014-11-12 16:48 GMT-08:00 jamborta : > I have figured out that building the fat jar with sbt does not seem to > included the pyspark scripts using the following command: > > sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phi

Re: Yarn-Client Python

2014-10-28 Thread Andrew Or
Hey TJ, It appears that your ApplicationMaster thinks it's on the same node as your driver. Are you setting "spark.driver.host" by any chance? Can you post the value of this config here? (You can access it through the SparkUI) 2014-10-28 12:50 GMT-07:00 TJ Klein : > Hi there, > > I am trying to

Re: Spark 1.0.0 on yarn cluster problem

2014-10-23 Thread Andrew Or
Did you `export` the environment variables? Also, are you running in client mode or cluster mode? If it still doesn't work you can try to set these through the spark-submit command lines --num-executors, --executor-cores, and --executor-memory. 2014-10-23 19:25 GMT-07:00 firemonk9 : > Hi, > >

Re: Shuffle issues in the current master

2014-10-23 Thread Andrew Or
To add to Aaron's response, `spark.shuffle.consolidateFiles` only applies to hash-based shuffle, so you shouldn't have to set it for sort-based shuffle. And yes, since you changed neither `spark.shuffle.compress` nor `spark.shuffle.spill.compress` you can't possibly have run into what #2890 fixes.

Re: Setting only master heap

2014-10-23 Thread Andrew Or
Yeah, as Sameer commented, there is unfortunately not an equivalent `SPARK_MASTER_MEMORY` that you can set. You can work around this by starting the master and the slaves separately with different settings of SPARK_DAEMON_MEMORY each time. AFAIK there haven't been any major changes in the standalo

Re: how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread Andrew Or
Hm, it works for me. Are you sure you have provided the right jars? What happens if you pass in the `--verbose` flag? 2014-10-16 23:51 GMT-07:00 eric wong : > Hi, > > i using the comma separated style for submit multiple jar files in the > follow shell but it does not work: > > bin/spark-submit -

Re: executors not created yarn-cluster mode

2014-10-08 Thread Andrew Or
Hi Jamborta, It could be that your executors are requesting too much memory. I'm not sure why it works in client mode but not in cluster mode, however. Have you checked the RM logs for messages that complain about container memory requested being too high? How much memory is each of your container

Re: Running Spark cluster on local machine, cannot connect to master error

2014-10-08 Thread Andrew Or
Hi Russell and Theodore, This usually means your Master / Workers / client machine are running different versions of Spark. On a local machine, you may want to restart your master and workers (sbin/stop-all.sh, then sbin/start-all.sh). On a real cluster, you want to make sure that every node (incl

Re: Spark on YARN driver memory allocation bug?

2014-10-08 Thread Andrew Or
Hi Greg, It does seem like a bug. What is the particular exception message that you see? Andrew 2014-10-08 12:12 GMT-07:00 Greg Hill : > So, I think this is a bug, but I wanted to get some feedback before I > reported it as such. On Spark on YARN, 1.1.0, if you specify the > --driver-memory v

Re: anyone else seeing something like https://issues.apache.org/jira/browse/SPARK-3637

2014-10-07 Thread Andrew Or
Hi Steve, what Spark version are you running? 2014-10-07 14:45 GMT-07:00 Steve Lewis : > java.lang.NullPointerException > at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:54

Re: Is there a way to provide individual property to each Spark executor?

2014-10-02 Thread Andrew Or
Hi Vladimir, This is not currently supported, but users have asked for it in the past. I have filed an issue for it here: https://issues.apache.org/jira/browse/SPARK-3767 so we can track its progress. Andrew 2014-10-02 5:25 GMT-07:00 Vladimir Tretyakov < vladimir.tretya...@sematext.com>: > Hi,

Re: weird YARN errors on new Spark on Yarn cluster

2014-10-02 Thread Andrew Or
asted in the original email, btw. > > Greg > > From: Andrew Or > Date: Thursday, October 2, 2014 12:24 PM > To: Greg > Cc: "user@spark.apache.org" > Subject: Re: weird YARN errors on new Spark on Yarn cluster > > Hi Greg, > > Have you looked

Re: weird YARN errors on new Spark on Yarn cluster

2014-10-02 Thread Andrew Or
Hi Greg, Have you looked at the AM container logs? (You may already know this, but) you can get these through the RM web UI or through: yarn logs -applicationId If an AM throws an exception then the executors may not be started properly. -Andrew 2014-10-02 9:47 GMT-07:00 Greg Hill : > I h

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Andrew Or
Hi Tamas, Yes, Marcelo is right. The reason why it doesn't make sense to set "spark.driver.memory" in your SparkConf is because your application code, by definition, *is* the driver. This means by the time you get to the code that initializes your SparkConf, your driver JVM has already started wit

Re: SPARK UI - Details post job processiong

2014-09-25 Thread Andrew Or
Hi Harsha, You can turn on `spark.eventLog.enabled` as documented here: http://spark.apache.org/docs/latest/monitoring.html. Then, if you are running standalone mode, you can access the finished SparkUI through the Master UI. Otherwise, you can start a HistoryServer to display finished UIs. -Andr

Re: clarification for some spark on yarn configuration options

2014-09-23 Thread Andrew Or
defaults.conf than the environment variables, and looking at the > code you modified, I don't see any place it's picking up > spark.driver.memory either. Is that a separate bug? > > Greg > > > From: Andrew Or > Date: Monday, September 22, 2014 8:11 PM > To:

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Andrew Or
fine. >> >> Greg >> >> From: Nishkam Ravi >> Date: Monday, September 22, 2014 3:30 PM >> To: Greg >> Cc: Andrew Or , "user@spark.apache.org" < >> user@spark.apache.org> >> >> Subject: Re: clarification for

  1   2   3   >