Hi Jamal,

If what you want is to process lots of files in parallel, the best
approach is probably to load all file names into an array and
parallelize that. Then each task will take a path as input and can
process it however it wants.

Or you could write the file list to a file, and then use sc.textFile()
to open it (assuming one path per line), and the rest is pretty much
the same as above.

It will probably be hard to process each individual file in parallel,
unless mp3 and jpg files can be split into multiple blocks that can be
processed separately. In that case, you'd need a custom (Hadoop) input
format that is able to calculate the splits. But it doesn't sound like
that's what you want.



On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com> wrote:
> Hi,
>   How do one process for data sources other than text?
> Lets say I have millions of mp3 (or jpeg) files and I want to use spark to
> process them?
> How does one go about it.
>
>
> I have never been able to figure this out..
> Lets say I have this library in python which works like following:
>
> import audio
>
> song = audio.read_mp3(filename)
>
> Then most of the methods are attached to song or maybe there is another
> function which takes "song" type as an input.
>
> Maybe the above is just rambling.. but how do I use spark to process (say)
> audiio files.
> Thanks



-- 
Marcelo

Reply via email to