Re: Quick one... AWS SDK version?

2017-10-08 Thread Jonathan Kelly
#x27;s ur usecase? are you writing to S3? you could use > Spark to do that , e.g using hadoop package > org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client > which is in line with hadoop 2.7.1? > > hth > marco > > On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kel

Re: Quick one... AWS SDK version?

2017-10-06 Thread Jonathan Kelly
Note: EMR builds Hadoop, Spark, et al, from source against specific versions of certain packages like the AWS Java SDK, httpclient/core, Jackson, etc., sometimes requiring some patches in these applications in order to work with versions of these dependencies that differ from what the applications

Re: RDD blocks on Spark Driver

2017-02-28 Thread Jonathan Kelly
Prithish, It would be helpful for you to share the spark-submit command you are running. ~ Jonathan On Sun, Feb 26, 2017 at 8:29 AM Prithish wrote: > Thanks for the responses, I am running this on Amazon EMR which runs the > Yarn cluster manager. > > On Sat, Feb 25, 2017 at 4:45 PM, liangyhg..

Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Jonathan Kelly
Prithish, I saw you posted this on SO, so I responded there just now. See http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr/42516161#42516161 In short, an hdfs:// path can't be used to configure log4j because log4j knows nothing about hdfs. Instead, since you are usin

Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Jonathan Kelly
d: connect failed: Connection refused >>> >>> channel 5: open failed: connect failed: Connection refused >>> >>> channel 22: open failed: connect failed: Connection refused >>> >>> channel 23: open failed: connect failed: Connection refused >>&g

Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-12 Thread Jonathan Kelly
I would not recommend opening port 50070 on your cluster, as that would give the entire world access to your data on HDFS. Instead, you should follow the instructions found here to create a secure tunnel to the cluster, through which you can proxy requests to the UIs using a browser plugin like Fox

Re: Unsubscribe - 3rd time

2016-06-29 Thread Jonathan Kelly
If at first you don't succeed, try, try again. But please don't. :) See the "unsubscribe" link here: http://spark.apache.org/community.html I'm not sure I've ever come across an email list that allows you to unsubscribe by responding to the list with "unsubscribe". At least, all of the Apache one

Re: Logging trait in Spark 2.0

2016-06-24 Thread Jonathan Kelly
Ted, how is that thread related to Paolo's question? On Fri, Jun 24, 2016 at 1:50 PM Ted Yu wrote: > See this related thread: > > > http://search-hadoop.com/m/q3RTtEor1vYWbsW&subj=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+ > > On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno > wrote: > >> Hi, >>

Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
g it, in case anyone else has > time to look at it before I do. > > On Mon, Jun 20, 2016 at 1:20 PM, Jonathan Kelly > wrote: > > Thanks for the confirmation! Shall I cut a JIRA issue? > > > > On Mon, Jun 20, 2016 at 10:42 AM Marcelo Vanzin > wrote: > >> > &

Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
6 at 7:04 AM, Jonathan Kelly > wrote: > > Does anybody have any thoughts on this? > > > > On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly > > wrote: > >> > >> I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT (commit > >> bdf5fe4143e5a

Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
Does anybody have any thoughts on this? On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly wrote: > I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT > (commit bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's > log4j.properties is not getting picked up in the exec

Re: Running Spark in local mode

2016-06-19 Thread Jonathan Kelly
Mich, what Jacek is saying is not that you implied that YARN relies on two masters. He's just clarifying that yarn-client and yarn-cluster modes are really both using the same (type of) master (simply "yarn"). In fact, if you specify "--master yarn-client" or "--master yarn-cluster", spark-submit w

Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-17 Thread Jonathan Kelly
I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT (commit bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's log4j.properties is not getting picked up in the executor classpath (and driver classpath for yarn-cluster mode), so Hadoop's log4j.properties file is taking precedence in the YARN

Re: Configure Spark Resource on AWS CLI Not Working

2016-03-01 Thread Jonathan Kelly
Weiwei, Please see this documentation for configuring Spark and other apps on EMR 4.x: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html This documentation about what has changed between 3.x and 4.x should also be helpful: http://docs.aws.amazon.com/ElasticMap

Re: scikit learn on EMR PySpark

2016-03-01 Thread Jonathan Kelly
Hi, Myles, We do not install scikit-learn or spark-sklearn on EMR clusters by default, but you may install them yourself by just doing "sudo pip install scikit-learn spark-sklearn" (either by ssh'ing to the master instance and running this manually, or by running it as an EMR Step). ~ Jonathan O

Re: Spark-avro issue in 1.5.2

2016-02-24 Thread Jonathan Kelly
This error is likely due to EMR including some Hadoop lib dirs in spark.{driver,executor}.extraClassPath. (Hadoop bundles an older version of Avro than what Spark uses, so you are probably getting bitten by this Avro mismatch.) We determined that these Hadoop dirs are not actually necessary to inc

Re: Error :Type mismatch error when passing hdfs file path to spark-csv load method

2016-02-21 Thread Jonathan Kelly
On the line preceding the one that the compiler is complaining about (which doesn't actually have a problem in itself), you declare df as "df"+fileName, making it a string. Then you try to assign a DataFrame to df, but it's already a string. I don't quite understand your intent with that previous l

Re: Memory issues on spark

2016-02-17 Thread Jonathan Kelly
(I'm not 100% sure, but...) I think the SPARK_EXECUTOR_* environment variables are intended to be used with Spark Standalone. Even if not, I'd recommend setting the corresponding properties in spark-defaults.conf rather than in spark-env.sh. For example, you may use the following Configuration obj

Re: AM creation in yarn-client mode

2016-02-09 Thread Jonathan Kelly
In yarn-client mode, the driver is separate from the AM. The AM is created in YARN, and YARN controls where it goes (though you can somewhat control it using YARN node labels--I just learned earlier today in a different thread on this list that this can be controlled by spark.yarn.am.labelExpressio

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-28 Thread Jonathan Kelly
Just FYI, Spark 1.6 was released on emr-4.3.0 a couple days ago: https://aws.amazon.com/blogs/aws/emr-4-3-0-new-updated-applications-command-line-export/ On Thu, Jan 28, 2016 at 7:30 PM Andrew Zurn wrote: > Hey Daniel, > > Thanks for the response. > > After playing around for a bit, it looks like

Re: Terminating Spark Steps in AWS

2016-01-26 Thread Jonathan Kelly
Daniel, The "hadoop job -list" command is a deprecated form of "mapred job -list", which is only for Hadoop MapReduce jobs. For Spark jobs, which run on YARN, you instead want "yarn application -list". Hope this helps, Jonathan (from the EMR team) On Tue, Jan 26, 2016 at 10:05 AM Daniel Imberman

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Jonathan Kelly
Yes, IAM roles are actually required now for EMR. If you use Spark on EMR (vs. just EC2), you get S3 configuration for free (it goes by the name EMRFS), and it will use your IAM role for communicating with S3. Here is the corresponding documentation: http://docs.aws.amazon.com/ElasticMapReduce/late

Re: Discover SparkUI port for spark streaming job running in cluster mode

2015-12-14 Thread Jonathan Kelly
mat depends on the scheduler implementation. >* (i.e. >* in case of local spark app something like 'local-1433865536131' >* in case of YARN something like 'application_1433865536131_34483' >* ) >*/ > def applicationId: String = _applicationId >

Re: Discover SparkUI port for spark streaming job running in cluster mode

2015-12-14 Thread Jonathan Kelly
Are you running Spark on YARN? If so, you can get to the Spark UI via the YARN ResourceManager. Each running Spark application will have a link on the YARN ResourceManager labeled "ApplicationMaster". If you click that, it will take you to the Spark UI, even if it is running on a slave node in the

Re: spark-ec2 vs. EMR

2015-12-04 Thread Jonathan Kelly
-subscribed too, so hopefully this works... On Wednesday, December 2, 2015, Jonathan Kelly wrote: > EMR is currently running a private preview of an upcoming feature allowing > EMR clusters to be launched in VPC private subnets. This will allow you to > launch a cluster in a subnet wit

Re: spark-ec2 vs. EMR

2015-12-02 Thread Jonathan Kelly
EMR is currently running a private preview of an upcoming feature allowing EMR clusters to be launched in VPC private subnets. This will allow you to launch a cluster in a subnet without an Internet Gateway attached. Please contact jonfr...@amazon.com if you would like more information. ~ Jonathan

Re: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Jonathan Kelly
I don't know if this actually has anything to do with why your job is hanging, but since you are using EMR you should probably not set those fs.s3 properties but rather let it use EMRFS, EMR's optimized Hadoop FileSystem implementation for interacting with S3. One benefit is that it will automatica

Re: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-19 Thread Jonathan Kelly
This file only exists on the master and not the slave nodes, so you are probably running into https://issues.apache.org/jira/browse/SPARK-11105, which has already been fixed in the not-yet-released Spark 1.6.0. EMR will upgrade to Spark 1.6.0 once it is released. ~ Jonathan On Thu, Nov 19, 2015 a

Re: spark-submit stuck and no output in console

2015-11-16 Thread Jonathan Kelly
He means for you to use jstack to obtain a stacktrace of all of the threads. Or are you saying that the Java process never even starts? On Mon, Nov 16, 2015 at 7:48 AM, Kayode Odeyemi wrote: > Spark 1.5.1 > > The fact is that there's no stack trace. No output from that command at > all to the co

Re: Spark 1.5.1 Dynamic Resource Allocation

2015-11-09 Thread Jonathan Kelly
Tom, You might be hitting https://issues.apache.org/jira/browse/SPARK-10790, which was introduced in Spark 1.5.0 and fixed in 1.5.2. Spark 1.5.2 just passed release candidate voting, so it should be tagged, released and announced soon. If you are able to build from source yourself and run with tha

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Jonathan Kelly
Christian, Is there anything preventing you from using EMR, which will manage your cluster for you? Creating large clusters would take mins on EMR instead of hours. Also, EMR supports growing your cluster easily and recently added support for shrinking your cluster gracefully (even while jobs are

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-14 Thread Jonathan Kelly
t; public IP; replacing IPs brings me to the same Spark GUI. > > Joshua > [image: Inline image 3] > > > > > On Tue, Oct 13, 2015 at 6:23 PM, Jonathan Kelly > wrote: > >> Joshua, >> >> Since Spark is configured to run on YARN in EMR, instead

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Jonathan Kelly
Joshua, Since Spark is configured to run on YARN in EMR, instead of viewing the Spark application UI at port 4040, you should instead start from the YARN ResourceManager (on port 8088), then click on the ApplicationMaster link for the Spark application you are interested in. This will take you to

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-24 Thread Jonathan Kelly
I cut https://issues.apache.org/jira/browse/SPARK-10790 for this issue. On Wed, Sep 23, 2015 at 8:38 PM, Jonathan Kelly wrote: > AHA! I figured it out, but it required some tedious remote debugging of > the Spark ApplicationMaster. (But now I understand the Spark codebase a > little be

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
equest any executors and will just hang indefinitely. I can't seem to find a JIRA for this, so shall I file one, or has anybody else seen anything like this? ~ Jonathan On Wed, Sep 23, 2015 at 7:08 PM, Jonathan Kelly wrote: > Another update that doesn't make much sense: > > Th

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
ll do work. ~ Jonathan On Wed, Sep 23, 2015 at 6:22 PM, Jonathan Kelly wrote: > Thanks for the quick response! > > spark-shell is indeed using yarn-client. I forgot to mention that I also > have "spark.master yarn-client" in my spark-defaults.conf file too. > > T

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
pool is the spark shell being put into? (You can see this through the > YARN UI under scheduler) > > Are you certain you're starting spark-shell up on YARN? By default it uses > a local spark executor, so if it "just works" then it's because it's not > using d

Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
I'm running into a problem with YARN dynamicAllocation on Spark 1.5.0 after using it successfully on an identically configured cluster with Spark 1.4.1. I'm getting the dreaded warning "YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers a