Thanks. Let me go thru it.

On Mon, Jun 2, 2014 at 5:15 PM, Philip Ogren <philip.og...@oracle.com>
wrote:

> I asked a question related to Marcelo's answer a few months ago. The
> discussion there may be useful:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html
>
>
>
> On 06/02/2014 06:09 PM, Marcelo Vanzin wrote:
>
>> Hi Jamal,
>>
>> If what you want is to process lots of files in parallel, the best
>> approach is probably to load all file names into an array and
>> parallelize that. Then each task will take a path as input and can
>> process it however it wants.
>>
>> Or you could write the file list to a file, and then use sc.textFile()
>> to open it (assuming one path per line), and the rest is pretty much
>> the same as above.
>>
>> It will probably be hard to process each individual file in parallel,
>> unless mp3 and jpg files can be split into multiple blocks that can be
>> processed separately. In that case, you'd need a custom (Hadoop) input
>> format that is able to calculate the splits. But it doesn't sound like
>> that's what you want.
>>
>>
>>
>> On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>    How do one process for data sources other than text?
>>> Lets say I have millions of mp3 (or jpeg) files and I want to use spark
>>> to
>>> process them?
>>> How does one go about it.
>>>
>>>
>>> I have never been able to figure this out..
>>> Lets say I have this library in python which works like following:
>>>
>>> import audio
>>>
>>> song = audio.read_mp3(filename)
>>>
>>> Then most of the methods are attached to song or maybe there is another
>>> function which takes "song" type as an input.
>>>
>>> Maybe the above is just rambling.. but how do I use spark to process
>>> (say)
>>> audiio files.
>>> Thanks
>>>
>>
>>
>>
>

Reply via email to