Spark uses any inputformat you specify and number of splits=number of RDD partitions. You may want to take a deeper look at SparkContext.newAPIHadoopRDD to load your data.
On Sat, May 9, 2015 at 4:48 PM, tog <guillaume.all...@gmail.com> wrote: > Hi > > I havé an application that currently run using MR. It currently starts > extracting information from a proprietary binary file that is copied to > HDFS. The application starts by creating business objects from information > extracted from the binary files. Later those objects are used for further > processing using again MR jobs. > > I am planning to move towards Spark and I clearly see that I could use > JavaRDD<businessObjects> for parallel processing. however it is not yet > obvious what could be the process to generate this RDD from my binary file > in parallel. > > Today I use parallelism based on the split assign to each of the map > elements of a job. Can I mimick such a thing using spark. All example I > have seen so far are using text files for which I guess the partitions are > based on a given number of contiguous lines. > > Any help or pointer would be appreciated > > Cheers > Guillaume > > > -- > PGP KeyID: 2048R/EA31CFC9 subkeys.pgp.net > -- Best Regards, Ayan Guha