With binaryRecords it loads the file by line into RDD. With binaryFiles it provides an input stream, so it is up to you if you want to load everything into memory. Though documentation does not suggest to use this function for large files.
From: Vijayasarathy Kannan [mailto:kvi...@vt.edu] Sent: Wednesday, May 06, 2015 2:23 PM To: Ulanov, Alexander Cc: user@spark.apache.org Subject: Re: Reading large files Thanks. In both cases, does the driver need to have enough memory to contain the entire file? How do both these functions work when, for example, the binary file is 4G and available driver memory is lesser? On Wed, May 6, 2015 at 1:54 PM, Ulanov, Alexander <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: SparkContext has two methods for reading binary files: binaryFiles (reads multiple binary files into RDD) and binaryRecords (reads separate lines of a single binary file into RDD). For example, I have a big binary file split into logical parts, so I can use “binaryFiles”. The possible problem is that the order of records between parts is not preserved, so I have to do sortBy afterwards. Alexander From: Vijayasarathy Kannan [mailto:kvi...@vt.edu<mailto:kvi...@vt.edu>] Sent: Wednesday, May 06, 2015 10:38 AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Reading large files Hi, Is there a way to read a large file, in parallel/distributed way? I have a single large binary file which I currently read on the driver program and then distribute it to executors (using groupBy(), etc.). I want to know if there's a way to make the executors each read a specific/unique portion of the file or create RDDs of multiple portions of the file and finally union them. Thanks.