Once again I am trying to read a directory tree using binary files.

My directory tree has a root dir ROOTDIR and subdirs where the files are
located, i.e.

ROOTDIR/1
ROOTDIR/2
ROOTDIR/..
ROOTDIR/100

A total of 1 mil files split into 100 sub dirs

Using binaryFiles requires too much memory on the driver. I've also tried
rdds of binaryFiles(each subdir) and then ++ those and
rdd.saveAsObjectFile("outputDir"). That causes a lot of memory to be
required in the executors! I've also tried to save objectFiles for each
subdirectory separately. And then merge them. But that fails without an
exception, just the driver says it lost connection with an executor and the
executor saying it got a request to terminate by the driver!

On top of that, when running the job and after some (or a lot) of time
passes, the job fails. The only thing I can see in the logs is that the
driver lost connection with an executor. The executor log seems to indicate
that the driver asked a shutdown. No exception or error msg ???

What is the proper way to use binaryFiles with this number of files?

I've tried the same approach using sc.wholeTextFiles. I am not getting the
memory issue but still it fails.

Thanks 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/binaryFiles-for-1-million-files-too-much-memory-required-tp23590.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to