The idea is simple. If you want to run something on a collection of
files, do (in pseudo-python):

def processSingleFile(path):
  # Your code to process a file

files = [ "file1", "file2" ]
sc.parallelize(files).foreach(processSingleFile)


On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha <jamalsha...@gmail.com> wrote:
> Hi Marcelo,
>   Thanks for the response..
> I am not sure I understand. Can you elaborate a bit.
> So, for example, lets take a look at this example
> http://pythonvision.org/basic-tutorial
>
> import mahotas
> dna = mahotas.imread('dna.jpeg')
> dnaf = ndimage.gaussian_filter(dna, 8)
>
> But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to run
> the above logic on all the millions files.
> How should I go about this?
> Thanks
>
> On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> Hi Jamal,
>>
>> If what you want is to process lots of files in parallel, the best
>> approach is probably to load all file names into an array and
>> parallelize that. Then each task will take a path as input and can
>> process it however it wants.
>>
>> Or you could write the file list to a file, and then use sc.textFile()
>> to open it (assuming one path per line), and the rest is pretty much
>> the same as above.
>>
>> It will probably be hard to process each individual file in parallel,
>> unless mp3 and jpg files can be split into multiple blocks that can be
>> processed separately. In that case, you'd need a custom (Hadoop) input
>> format that is able to calculate the splits. But it doesn't sound like
>> that's what you want.
>>
>>
>>
>> On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com> wrote:
>> > Hi,
>> >   How do one process for data sources other than text?
>> > Lets say I have millions of mp3 (or jpeg) files and I want to use spark
>> > to
>> > process them?
>> > How does one go about it.
>> >
>> >
>> > I have never been able to figure this out..
>> > Lets say I have this library in python which works like following:
>> >
>> > import audio
>> >
>> > song = audio.read_mp3(filename)
>> >
>> > Then most of the methods are attached to song or maybe there is another
>> > function which takes "song" type as an input.
>> >
>> > Maybe the above is just rambling.. but how do I use spark to process
>> > (say)
>> > audiio files.
>> > Thanks
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

Reply via email to