wholeTextFiles not working with HDFS

2014-06-12 Thread Sguj
I'm trying to get a list of every filename in a directory from HDFS using
pySpark, and the only thing that seems like it would return the filenames is
the wholeTextFiles function. My code for just trying to collect that data is
this:

   files = sc.wholeTextFiles("hdfs://localhost:port/users/me/target")
   files = files.collect()

These lines return the error "java.io.FileNotFoundException: File
/user/me/target/capacity-scheduler.xml does not exist" which makes it seem
like the hdfs information isn't getting used with the wholeTextFiles
function. 

Those lines work if I use them on a local filesystem directory, and the
textFile() function works on the HDFS directory I'm trying to use
wholeTextFiles() on.

I need a way to either fix this, or an alternate method of reading the
filenames from a directory in HDFS.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles not working with HDFS

2014-06-13 Thread Sguj
My exception stack looks about the same.

java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:173)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094)
at org.apache.spark.rdd.RDD.collect(RDD.scala:717)

I'm using Hadoop 1.2.1, and everything else I've tried in Spark with that
version has worked, so I doubt it's a version error.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I didn't fix the issue so much as work around it. I was running my cluster
locally, so using HDFS was just a preference. The code worked with the local
file system, so that's what I'm using until I can get some help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread Sguj
I've been trying to figure out how to increase the heap space for my spark
environment in 1.0.0, and all of the things I've found tell me I have export
something in Java Opts, which is deprecated in 1.0.0, or in increase the
spark.executor.memory, which is at 6G. I'm only trying to process about
400-500 mB of text, but I get this error whenever I try to collect:

14/06/17 11:00:21 INFO MapOutputTrackerMasterActor: Asked to send map output
locations for shuffle 0 to sp...@salinger.ornl.gov:50251
14/06/17 11:00:21 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 0 is 165 bytes
14/06/17 11:00:35 INFO BlockManagerInfo: Added taskresult_14 in memory on
salinger.ornl.gov:50253 (size: 123.7 MB, free: 465.1 MB)
14/06/17 11:00:35 INFO BlockManagerInfo: Added taskresult_13 in memory on
salinger.ornl.gov:50253 (size: 127.7 MB, free: 337.4 MB)
14/06/17 11:00:36 ERROR Utils: Uncaught exception in thread Result resolver
thread-2
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.(HeapByteBuffer.java:39)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
at
org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
at
org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
at
org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
at
org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:128)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:489)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:487)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:487)
at
org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:481)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:53)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)

Any idea how to fix heap space errors in 1.0.0?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-tp7733.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread Sguj
I've been trying to figure out how to increase the heap space for my spark
environment in 1.0.0, and all of the things I've found tell me I have export
something in Java Opts, which is deprecated in 1.0.0, or in increase the
spark.executor.memory, which is at 6G. I'm only trying to process about
400-500 mB of text, but I get this error whenever I try to collect: 

java.lang.OutOfMemoryError: Java heap space 
at java.nio.HeapByteBuffer.(HeapByteBuffer.java:39) 
at java.nio.ByteBuffer.allocate(ByteBuffer.java:312) 
at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) 
at
org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) 
at
org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) 
at
org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 
at
org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:128)
 
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:489)
 
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:487)
 
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:487) 
at
org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:481) 
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:53)
 
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) 
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) 
at java.lang.Thread.run(Thread.java:695) 

Any idea how to fix heap space errors in 1.0.0?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-tp7735.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I can write one if you'll point me to where I need to write it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread Sguj
Am I trying to reduce it to the minimum number of partitions, or increase the
number of partitions with that change?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-tp7735p7739.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Sguj
I got rid of most of my heap errors by increasing the number of partitions of
my RDDs by 8-16x. I found in the  tuning page
   that heap space errors
can be caused by a hash table that's generated during the shuffle functions,
so by splitting up how much is in each shuffle function with partitions, I
was able to get rid of the errors. Thanks for putting me on the right path
with the partitions.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-tp7735p7843.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Serializer or Out-of-Memory issues?

2014-06-30 Thread Sguj
I'm trying to perform operations on a large RDD, that ends up being about 1.3
GB in memory when loaded in. It's being cached in memory during the first
operation, but when another task begins that uses the RDD, I'm getting this
error that says the RDD was lost:

14/06/30 09:48:17 INFO TaskSetManager: Serialized task 1.0:4 as 8245 bytes
in 0 ms
14/06/30 09:48:17 WARN TaskSetManager: Lost TID 15611 (task 1.0:3)
14/06/30 09:48:17 WARN TaskSetManager: Loss was due to
org.apache.spark.api.python.PythonException
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File "/Users/me/Desktop/spark-1.0.0/python/pyspark/worker.py", line 73, in
main
command = pickleSer._read_with_length(infile)
  File "/Users/me/Desktop/spark-1.0.0/python/pyspark/serializers.py", line
142, in _read_with_length
length = read_int(stream)
  File "/Users/me/Desktop/spark-1.0.0/python/pyspark/serializers.py", line
337, in read_int
raise EOFError
EOFError

at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
at
org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:145)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/06/30 09:48:18 INFO AppClient$ClientActor: Executor updated:
app-20140630090515-/0 is now FAILED (Command exited with code 52)
14/06/30 09:48:18 INFO SparkDeploySchedulerBackend: Executor
app-20140630090515-/0 removed: Command exited with code 52
14/06/30 09:48:18 INFO SparkDeploySchedulerBackend: Executor 0 disconnected,
so removing it
14/06/30 09:48:18 ERROR TaskSchedulerImpl: Lost executor 0 on localhost:
OutOfMemoryError
14/06/30 09:48:18 INFO TaskSetManager: Re-queueing tasks for 0 from TaskSet
1.0
14/06/30 09:48:18 WARN TaskSetManager: Lost TID 15610 (task 1.0:2)
14/06/30 09:48:18 WARN TaskSetManager: Lost TID 15609 (task 1.0:1)
14/06/30 09:48:18 WARN TaskSetManager: Lost TID 15612 (task 1.0:4)
14/06/30 09:48:18 WARN TaskSetManager: Lost TID 15608 (task 1.0:0)


The operation it fails on is a ReduceByKey(), and the RDD before the
operation is split into several thousand partitions (I'm doing term
weighting that requires a different partition initially for each document),
and the system has 6 GB of memory for the executor, so I'm not sure if it's
actually a memory error, as is mentioned 5 lines from the end of the error.
The serializer error portion is what's really confusing me, and I can't find
references to this particular error with Spark anywhere.

Does anyone have a clue as to what the actual error might be here, and what
a possible solution would be?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Serializer-or-Out-of-Memory-issues-tp8533.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.