Hi,
I'm using --use-existing-master to launch a previous stopped ec2 cluster
with spark-ec2. However, my configuration files are overwritten once is the
cluster is setup. What's the best way of preserving existing configuration
files in spark/conf.
Alternatively, what I'm trying to do is set SPARK
Thanks for the response Sean.
As a correction. The code I provided actually ended up working. I tried to
reduce my code down but I was being overzealous and running count actually
works.
The minimal code that triggers the problem is this:
val userProfiles = lines.map(line => {parse(line)}).map(js
Hi,
I have a quick serialization issue. I'm trying to read a date range of input
files and I'm getting a serialization issue when using an input path that
has a object generate a date range. Specifically, my code uses
DateTimeFormat in the Joda time package, which is not serializable. How do I
get
As an update. I'm still getting the same issue. I ended up doing a coalesce
instead of a cache to get around the memory issue but saveAsTextFile still
won't proceed without the coalesce or cache first.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTe
Hi,
I'm running on the master branch and I noticed that textFile ignores
minPartition for bz2 files. Is anyone else seeing the same thing? I tried
varying minPartitions for a bz2 file and rdd.partitions.size was always 1
whereas doing it for a non-bz2 file worked.
Not sure if this matters or not
Hi,
I'm running on branch-1.1 and trying to do a simple transformation to a
relatively small dataset of 64GB and saveAsTextFile essentially hangs and
tasks are stuck in running mode with the following code:
// Stalls with tasks running for over an hour with no tasks finishing.
Smallest partition i
bump. same problem here.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
I've isolated this to a memory issue but I don't know what parameter I need
to tweak. If I sample my samples RDD with 35% of the data, everything runs
to completion, with 35%, it fails. In standalone mode, I can run on the full
RDD without any problems.
// works
val samples = sc.textFile("s3n://ge
Using s3n:// worked for me.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-unable-to-connect-to-S3-bucket-Invalid-hostname-in-URI-tp12076p12082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Hi all,
I have an issue where I'm able to run my code in standalone mode but not on
my cluster. I've isolated it to a few things but am at a lost at how to
debug this. Below is the code. Any suggestions would be much appreciated
Thanks!
1) RDD size is causing the problem. The code below as is fai
Hi,
I've been trying to use com.twitter.chill.MeatLocker to serialize a
third-party class. So far I'm having no luck and I'm still getting the
dreaded Task not Serializable error for org.ahocorasick.trie.Trie. Am I
doing something obviously wrong?
Below is my test code that is failing:
import co
11 matches
Mail list logo