Thanks for the answer but this seems to apply for files that are havin a key-value structure which I currently don't have. My file is a generic binary file encoding data from sensors over time. I am just looking at recreating some objects by assigning splits (ie continuous chunks of bytes) to each computing elements (maps ?) the outcome of this "filter" will be an RDD of "observations" which will have been detected and recreated from that file.
On Saturday, May 9, 2015, ayan guha <guha.a...@gmail.com> wrote: > Spark uses any inputformat you specify and number of splits=number of RDD > partitions. You may want to take a deeper look at > SparkContext.newAPIHadoopRDD to load your data. > > > > On Sat, May 9, 2015 at 4:48 PM, tog <guillaume.all...@gmail.com > <javascript:_e(%7B%7D,'cvml','guillaume.all...@gmail.com');>> wrote: > >> Hi >> >> I havé an application that currently run using MR. It currently starts >> extracting information from a proprietary binary file that is copied to >> HDFS. The application starts by creating business objects from information >> extracted from the binary files. Later those objects are used for further >> processing using again MR jobs. >> >> I am planning to move towards Spark and I clearly see that I could use >> JavaRDD<businessObjects> for parallel processing. however it is not yet >> obvious what could be the process to generate this RDD from my binary file >> in parallel. >> >> Today I use parallelism based on the split assign to each of the map >> elements of a job. Can I mimick such a thing using spark. All example I >> have seen so far are using text files for which I guess the partitions are >> based on a given number of contiguous lines. >> >> Any help or pointer would be appreciated >> >> Cheers >> Guillaume >> >> >> -- >> PGP KeyID: 2048R/EA31CFC9 subkeys.pgp.net >> > > > > -- > Best Regards, > Ayan Guha > -- PGP KeyID: 2048R/EA31CFC9 subkeys.pgp.net