Serializer or Out-of-Memory issues?

2014-06-30 Thread Sguj
I'm trying to perform operations on a large RDD, that ends up being about 1.3 GB in memory when loaded in. It's being cached in memory during the first operation, but when another task begins that uses the RDD, I'm getting this error that says the RDD was lost: 14/06/30 09:48:17 INFO TaskSetManage

Re: Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Sguj
I got rid of most of my heap errors by increasing the number of partitions of my RDDs by 8-16x. I found in the tuning page that heap space errors can be caused by a hash table that's generated during the shuffle functions, so by splitting up how

Re: Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread Sguj
Am I trying to reduce it to the minimum number of partitions, or increase the number of partitions with that change? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-tp7735p7739.html Sent from the Apache

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I can write one if you'll point me to where I need to write it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread Sguj
I've been trying to figure out how to increase the heap space for my spark environment in 1.0.0, and all of the things I've found tell me I have export something in Java Opts, which is deprecated in 1.0.0, or in increase the spark.executor.memory, which is at 6G. I'm only trying to process about 40

Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread Sguj
I've been trying to figure out how to increase the heap space for my spark environment in 1.0.0, and all of the things I've found tell me I have export something in Java Opts, which is deprecated in 1.0.0, or in increase the spark.executor.memory, which is at 6G. I'm only trying to process about 40

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I didn't fix the issue so much as work around it. I was running my cluster locally, so using HDFS was just a preference. The code worked with the local file system, so that's what I'm using until I can get some help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

Re: wholeTextFiles not working with HDFS

2014-06-13 Thread Sguj
My exception stack looks about the same. java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFil

wholeTextFiles not working with HDFS

2014-06-12 Thread Sguj
I'm trying to get a list of every filename in a directory from HDFS using pySpark, and the only thing that seems like it would return the filenames is the wholeTextFiles function. My code for just trying to collect that data is this: files = sc.wholeTextFiles("hdfs://localhost:port/users/me