Re: spark and binary files

ayan guha Sat, 09 May 2015 03:20:40 -0700

Spark uses any inputformat you specify and number of splits=number of RDD
partitions. You may want to take a deeper look at
SparkContext.newAPIHadoopRDD to load your data.




On Sat, May 9, 2015 at 4:48 PM, tog <guillaume.all...@gmail.com> wrote:

> Hi
>
> I havé an application that currently run using MR. It currently starts
> extracting information from a proprietary binary file that is copied to
> HDFS. The application starts by creating business objects from information
> extracted from the binary files. Later those objects are used for further
> processing using again MR jobs.
>
> I am planning to move towards Spark and I clearly see that I could use
> JavaRDD<businessObjects> for parallel processing. however it is not yet
> obvious what could be the process to generate this RDD from my binary file
> in parallel.
>
> Today I use parallelism based on the split assign to each of the map
> elements of a job. Can I mimick such a thing using spark. All example I
> have seen so far are using text files for which I guess the partitions are
> based on a given number of contiguous lines.
>
> Any help or pointer would be appreciated
>
> Cheers
> Guillaume
>
>
> --
> PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net
>



-- 
Best Regards,
Ayan Guha

Re: spark and binary files

Reply via email to