Once again I am trying to read a directory tree using binary files. My directory tree has a root dir ROOTDIR and subdirs where the files are located, i.e.
ROOTDIR/1 ROOTDIR/2 ROOTDIR/.. ROOTDIR/100 A total of 1 mil files split into 100 sub dirs Using binaryFiles requires too much memory on the driver. I've also tried rdds of binaryFiles(each subdir) and then ++ those and rdd.saveAsObjectFile("outputDir"). That causes a lot of memory to be required in the executors! I've also tried to save objectFiles for each subdirectory separately. And then merge them. But that fails without an exception, just the driver says it lost connection with an executor and the executor saying it got a request to terminate by the driver! On top of that, when running the job and after some (or a lot) of time passes, the job fails. The only thing I can see in the logs is that the driver lost connection with an executor. The executor log seems to indicate that the driver asked a shutdown. No exception or error msg ??? What is the proper way to use binaryFiles with this number of files? I've tried the same approach using sc.wholeTextFiles. I am not getting the memory issue but still it fails. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/binaryFiles-for-1-million-files-too-much-memory-required-tp23590.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org