With binaryRecords it loads the file by line into RDD. With binaryFiles it 
provides an input stream, so it is up to you if you want to load everything 
into memory. Though documentation does not suggest to use this function for 
large files.

From: Vijayasarathy Kannan [mailto:kvi...@vt.edu]
Sent: Wednesday, May 06, 2015 2:23 PM
To: Ulanov, Alexander
Cc: user@spark.apache.org
Subject: Re: Reading large files

Thanks.

In both cases, does the driver need to have enough memory to contain the entire 
file? How do both these functions work when, for example, the binary file is 4G 
and available driver memory is lesser?

On Wed, May 6, 2015 at 1:54 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
SparkContext has two methods for reading binary files: binaryFiles (reads 
multiple binary files into RDD) and binaryRecords (reads separate lines of a 
single binary file into RDD). For example, I have a big binary file split into 
logical parts, so I can use “binaryFiles”. The possible problem is that the 
order of records between parts is not preserved, so I have to do sortBy 
afterwards.

Alexander

From: Vijayasarathy Kannan [mailto:kvi...@vt.edu<mailto:kvi...@vt.edu>]
Sent: Wednesday, May 06, 2015 10:38 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Reading large files

​Hi,

Is there a way to read a large file, in parallel​/distributed way? I have a 
single large binary file which I currently read on the driver program and then 
distribute it to executors (using groupBy(), etc.). I want to know if there's a 
way to make the executors each read a specific/unique portion of the file or 
create RDDs of multiple portions of the file and finally union them.

Thanks.

Reply via email to