Hi Jamal, If what you want is to process lots of files in parallel, the best approach is probably to load all file names into an array and parallelize that. Then each task will take a path as input and can process it however it wants.
Or you could write the file list to a file, and then use sc.textFile() to open it (assuming one path per line), and the rest is pretty much the same as above. It will probably be hard to process each individual file in parallel, unless mp3 and jpg files can be split into multiple blocks that can be processed separately. In that case, you'd need a custom (Hadoop) input format that is able to calculate the splits. But it doesn't sound like that's what you want. On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com> wrote: > Hi, > How do one process for data sources other than text? > Lets say I have millions of mp3 (or jpeg) files and I want to use spark to > process them? > How does one go about it. > > > I have never been able to figure this out.. > Lets say I have this library in python which works like following: > > import audio > > song = audio.read_mp3(filename) > > Then most of the methods are attached to song or maybe there is another > function which takes "song" type as an input. > > Maybe the above is just rambling.. but how do I use spark to process (say) > audiio files. > Thanks -- Marcelo