Hi all, Hitting a mysterious error loading large text files, specific to PySpark 0.9.0.
In PySpark 0.8.1, this works: data = sc.textFile("path/to/myfile") data.count() But in 0.9.0, it stalls. There are indications of completion up to: 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X (progress: 15/537) 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4) And then this repeats indefinitely 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5, runningTasks: 144 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5, runningTasks: 144 Always stalls at the same place. There's nothing in stderr on the workers, but in stdout there are several of these messages: INFO PythonRDD: stdin writer to Python finished early So perhaps the real error is being suppressed as in https://spark-project.atlassian.net/browse/SPARK-1025 Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k characters per row. Running on a private cluster with 10 nodes, 100GB / 16 cores each, Python v 2.7.6. I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0, and in PySpark in 0.8.1. Happy to post the file, but it should repro for anything with these dimensions. It *might* be specific to long strings: I don't see it with fewer characters (10k) per row, but I also don't see it with many fewer rows but the same number of characters per row. Happy to try and provide more info / help debug! -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html Sent from the Apache Spark User List mailing list archive at Nabble.com.