Perserving conf files when restarting ec2 cluster

2014-09-12 Thread jerryye
Hi, I'm using --use-existing-master to launch a previous stopped ec2 cluster with spark-ec2. However, my configuration files are overwritten once is the cluster is setup. What's the best way of preserving existing configuration files in spark/conf. Alternatively, what I'm trying to do is set SPARK

Re: Serialize input path

2014-09-05 Thread jerryye
Thanks for the response Sean. As a correction. The code I provided actually ended up working. I tried to reduce my code down but I was being overzealous and running count actually works. The minimal code that triggers the problem is this: val userProfiles = lines.map(line => {parse(line)}).map(js

Serialize input path

2014-09-04 Thread jerryye
Hi, I have a quick serialization issue. I'm trying to read a date range of input files and I'm getting a serialization issue when using an input path that has a object generate a date range. Specifically, my code uses DateTimeFormat in the Joda time package, which is not serializable. How do I get

Re: saveAsTextFile makes no progress without caching RDD

2014-09-02 Thread jerryye
As an update. I'm still getting the same issue. I ended up doing a coalesce instead of a cache to get around the memory issue but saveAsTextFile still won't proceed without the coalesce or cache first. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTe

minPartitions ignored for bz2?

2014-08-27 Thread jerryye
Hi, I'm running on the master branch and I noticed that textFile ignores minPartition for bz2 files. Is anyone else seeing the same thing? I tried varying minPartitions for a bz2 file and rdd.partitions.size was always 1 whereas doing it for a non-bz2 file worked. Not sure if this matters or not

saveAsTextFile makes no progress without caching RDD

2014-08-21 Thread jerryye
Hi, I'm running on branch-1.1 and trying to do a simple transformation to a relatively small dataset of 64GB and saveAsTextFile essentially hangs and tasks are stuck in running mode with the following code: // Stalls with tasks running for over an hour with no tasks finishing. Smallest partition i

Re: Job aborted due to stage failure: TID x failed for unknown reasons

2014-08-14 Thread jerryye
bump. same problem here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: How to debug: Runs locally but not on cluster

2014-08-13 Thread jerryye
I've isolated this to a memory issue but I don't know what parameter I need to tweak. If I sample my samples RDD with 35% of the data, everything runs to completion, with 35%, it fails. In standalone mode, I can run on the full RDD without any problems. // works val samples = sc.textFile("s3n://ge

Re: Python + Spark unable to connect to S3 bucket .... "Invalid hostname in URI"

2014-08-13 Thread jerryye
Using s3n:// worked for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-unable-to-connect-to-S3-bucket-Invalid-hostname-in-URI-tp12076p12082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

How to debug: Runs locally but not on cluster

2014-08-13 Thread jerryye
Hi all, I have an issue where I'm able to run my code in standalone mode but not on my cluster. I've isolated it to a few things but am at a lost at how to debug this. Below is the code. Any suggestions would be much appreciated Thanks! 1) RDD size is causing the problem. The code below as is fai

Serialization with com.twitter.chill.MeatLocker

2014-08-11 Thread jerryye
Hi, I've been trying to use com.twitter.chill.MeatLocker to serialize a third-party class. So far I'm having no luck and I'm still getting the dreaded Task not Serializable error for org.ahocorasick.trie.Trie. Am I doing something obviously wrong? Below is my test code that is failing: import co