PySpark + virtualenv: Using a different python path on the driver and on the executors

2017-02-25 Thread Tomer Benyamini
Hello, I'm trying to run pyspark using the following setup: - spark 1.6.1 standalone cluster on ec2 - virtualenv installed on master - app is run using the following command: export PYSPARK_DRIVER_PYTHON=/path_to_virtualenv/bin/python export PYSPARK_PYTHON=/usr/bin/python /root/spark/bin/spark-

Driver zombie process (standalone cluster)

2016-06-29 Thread Tomer Benyamini
Hi, I'm trying to run spark applications on a standalone cluster, running on top of AWS. Since my slaves are spot instances, in some cases they are being killed and lost due to bid prices. When apps are running during this event, sometimes the spark application dies - and the driver process just h

question about resource allocation on the spark standalone cluster

2015-06-30 Thread Tomer Benyamini
Hello spark-users, I would like to use the spark standalone cluster for multi-tenants, to run multiple apps at the same time. The issue is, when submitting an app to the spark standalone cluster, you cannot pass "--num-executors" like on yarn, but only "--total-executor-cores". *This may cause sta

running 2 spark applications in parallel on yarn

2015-02-01 Thread Tomer Benyamini
Hi all, I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that whenever I'm running a heavy computation job in parallel to other jobs running, I'm getting these kind of exceptions: * [task-result-getter-2] INFO org.apache.spark.scheduler.TaskSetManager- Lost task 820.0 in stage

Re: Spark UI port issue when deploying Spark driver on YARN in yarn-cluster mode on EMR

2014-12-23 Thread Tomer Benyamini
On YARN, spark does not manage the cluster, but YARN does. Usually the cluster manager UI is under http://:9026/cluster. I believe that it chooses the port for the spark driver UI randomly, but an easy way of accessing it is by clicking on the "Application Master" link under the "Tracking UI" colum

Re: custom spark app name in yarn-cluster mode

2014-12-15 Thread Tomer Benyamini
name there. I believe giving it with the --name property to spark-submit > should work. > > -Sandy > > On Thu, Dec 11, 2014 at 10:28 AM, Tomer Benyamini > wrote: >> >> >> >> On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini >> wrote: >> >>&g

Re: custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini
On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini wrote: > Hi, > > I'm trying to set a custom spark app name when running a java spark app in > yarn-cluster mode. > > SparkConf sparkConf = new SparkConf(); > > sparkConf.setMaster(System.getProperty("spark.ma

custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini
Hi, I'm trying to set a custom spark app name when running a java spark app in yarn-cluster mode. SparkConf sparkConf = new SparkConf(); sparkConf.setMaster(System.getProperty("spark.master")); sparkConf.setAppName("myCustomName"); sparkConf.set("spark.logConf", "true"); JavaSparkContext

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-29 Thread Tomer Benyamini
wever, you can change the FS being used like so (prior to the first >> usage): >> sc.hadoopConfiguration.set("fs.s3n.impl", >> "org.apache.hadoop.fs.s3native.NativeS3FileSystem") >> >> On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini >> wrote: &

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini
Thanks Lalit; Setting the access + secret keys in the configuration works even when calling sc.textFile. Is there a way to select which hadoop s3 native filesystem implementation would be used at runtime using the hadoop configuration? Thanks, Tomer On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 wr

S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini
Hello, I'm building a spark app required to read large amounts of log files from s3. I'm doing so in the code by constructing the file list, and passing it to the context as following: val myRDD = sc.textFile("s3n://mybucket/file1, s3n://mybucket/file2, ... , s3n://mybucket/fileN") When running

Rdd of Rdds

2014-10-22 Thread Tomer Benyamini
Hello, I would like to parallelize my work on multiple RDDs I have. I wanted to know if spark can support a "foreach" on an RDD of RDDs. Here's a java example: public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName("testapp"); sparkConf.setM

Spark-jobserver for java apps

2014-10-20 Thread Tomer Benyamini
Hi, I'm working on the problem of remotely submitting apps to the spark master. I'm trying to use the spark-jobserver project (https://github.com/ooyala/spark-jobserver) for that purpose. For scala apps looks like things are working smoothly, but for java apps, I have an issue with implementing t

Fwd: Cannot read from s3 using "sc.textFile"

2014-10-07 Thread Tomer Benyamini
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName("TestApp"); sparkConf.setMaster("local"); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "XX");

Cannot read from s3 using "sc.textFile"

2014-10-07 Thread Tomer Benyamini
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName("TestApp"); sparkConf.setMaster("local"); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "XX");

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
Yes exactly.. so I guess this is still an open request. Any workaround? On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas wrote: > Are you trying to do something along the lines of what's described here? > https://issues.apache.org/jira/browse/SPARK-3533 > > On Wed, Oct 1, 2014 a

MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
Hi, I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with MultipleTextOutputFormat,: outRdd.saveAsNewAPIHadoopFile("/tmp", String.class, String.class, MultipleTextOutputFormat.class); but I'm getting this compilation error: Bound mismatch: The generic method saveAsNewAPIHadoopFil

Upgrading a standalone cluster on ec2 from 1.0.2 to 1.1.0

2014-09-15 Thread Tomer Benyamini
Hi, I would like to upgrade a standalone cluster to 1.1.0. What's the best way to do it? Should I just replace the existing /root/spark folder with the uncompressed folder from http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz ? What about hdfs and other installations? I have spark 1.

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
ning with > datanode process > > -- > Ye Xianjin > Sent with Sparrow > > On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote: > > Still no luck, even when running stop-all.sh followed by start-all.sh. > > On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini wrote: >> >> ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2; >> >> I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and >> ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same err

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
org.apache.hadoop.tools.DistCp.main(DistCp.java:374) Any idea? Thanks! Tomer On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen wrote: > If I recall, you should be able to start Hadoop MapReduce using > ~/ephemeral-hdfs/sbin/start-mapred.sh. > > On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini wrote: >&

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
Do you have a mapreduce > cluster on your hdfs? > And from the error message, it seems that you didn't specify your jobtracker > address. > > -- > Ye Xianjin > Sent with Sparrow > > On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote: > > Hi, > >

distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster (Clust

Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/. It shows 1 node hdfs though, although I have 4 slaves on my cluster. Any idea why? On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski wrote: > > On 9/7/2014 7:27 AM, Tomer Benyamini wrote: >> >> 2. What shoul

Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Hi, I would like to make sure I'm not exceeding the quota on the local cluster's hdfs. I have a couple of questions: 1. How do I know the quota? Here's the output of hadoop fs -count -q which essentially does not tell me a lot root@ip-172-31-7-49 ~]$ hadoop fs -count -q / 2147483647 21474