SparkContext has two methods for reading binary files: binaryFiles (reads multiple binary files into RDD) and binaryRecords (reads separate lines of a single binary file into RDD). For example, I have a big binary file split into logical parts, so I can use “binaryFiles”. The possible problem is that the order of records between parts is not preserved, so I have to do sortBy afterwards.
Alexander From: Vijayasarathy Kannan [mailto:kvi...@vt.edu] Sent: Wednesday, May 06, 2015 10:38 AM To: user@spark.apache.org Subject: Reading large files Hi, Is there a way to read a large file, in parallel/distributed way? I have a single large binary file which I currently read on the driver program and then distribute it to executors (using groupBy(), etc.). I want to know if there's a way to make the executors each read a specific/unique portion of the file or create RDDs of multiple portions of the file and finally union them. Thanks.