SparkContext has two methods for reading binary files: binaryFiles (reads 
multiple binary files into RDD) and binaryRecords (reads separate lines of a 
single binary file into RDD). For example, I have a big binary file split into 
logical parts, so I can use “binaryFiles”. The possible problem is that the 
order of records between parts is not preserved, so I have to do sortBy 
afterwards.

Alexander

From: Vijayasarathy Kannan [mailto:kvi...@vt.edu]
Sent: Wednesday, May 06, 2015 10:38 AM
To: user@spark.apache.org
Subject: Reading large files

​Hi,

Is there a way to read a large file, in parallel​/distributed way? I have a 
single large binary file which I currently read on the driver program and then 
distribute it to executors (using groupBy(), etc.). I want to know if there's a 
way to make the executors each read a specific/unique portion of the file or 
create RDDs of multiple portions of the file and finally union them.

Thanks.

Reply via email to