Hi all,

Hitting a mysterious error loading large text files, specific to PySpark
0.9.0.

In PySpark 0.8.1, this works:

data = sc.textFile("path/to/myfile")
data.count()

But in 0.9.0, it stalls. There are indications of completion up to:

14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
(progress: 15/537)
14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)

And then this repeats indefinitely

14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
runningTasks: 144
14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
runningTasks: 144

Always stalls at the same place. There's nothing in stderr on the workers,
but in stdout there are several of these messages:

INFO PythonRDD: stdin writer to Python finished early

So perhaps the real error is being suppressed as in
https://spark-project.atlassian.net/browse/SPARK-1025

Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
characters per row. Running on a private cluster with 10 nodes, 100GB / 16
cores each, Python v 2.7.6.

I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
and in PySpark in 0.8.1. Happy to post the file, but it should repro for
anything with these dimensions. It *might* be specific to long strings: I
don't see it with fewer characters (10k) per row, but I also don't see it
with many fewer rows but the same number of characters per row.

Happy to try and provide more info / help debug!

-- Jeremy



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to