The idea is simple. If you want to run something on a collection of files, do (in pseudo-python):
def processSingleFile(path): # Your code to process a file files = [ "file1", "file2" ] sc.parallelize(files).foreach(processSingleFile) On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha <jamalsha...@gmail.com> wrote: > Hi Marcelo, > Thanks for the response.. > I am not sure I understand. Can you elaborate a bit. > So, for example, lets take a look at this example > http://pythonvision.org/basic-tutorial > > import mahotas > dna = mahotas.imread('dna.jpeg') > dnaf = ndimage.gaussian_filter(dna, 8) > > But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to run > the above logic on all the millions files. > How should I go about this? > Thanks > > On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin <van...@cloudera.com> wrote: >> >> Hi Jamal, >> >> If what you want is to process lots of files in parallel, the best >> approach is probably to load all file names into an array and >> parallelize that. Then each task will take a path as input and can >> process it however it wants. >> >> Or you could write the file list to a file, and then use sc.textFile() >> to open it (assuming one path per line), and the rest is pretty much >> the same as above. >> >> It will probably be hard to process each individual file in parallel, >> unless mp3 and jpg files can be split into multiple blocks that can be >> processed separately. In that case, you'd need a custom (Hadoop) input >> format that is able to calculate the splits. But it doesn't sound like >> that's what you want. >> >> >> >> On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com> wrote: >> > Hi, >> > How do one process for data sources other than text? >> > Lets say I have millions of mp3 (or jpeg) files and I want to use spark >> > to >> > process them? >> > How does one go about it. >> > >> > >> > I have never been able to figure this out.. >> > Lets say I have this library in python which works like following: >> > >> > import audio >> > >> > song = audio.read_mp3(filename) >> > >> > Then most of the methods are attached to song or maybe there is another >> > function which takes "song" type as an input. >> > >> > Maybe the above is just rambling.. but how do I use spark to process >> > (say) >> > audiio files. >> > Thanks >> >> >> >> -- >> Marcelo > > -- Marcelo