Thanks. Let me go thru it.
On Mon, Jun 2, 2014 at 5:15 PM, Philip Ogren <philip.og...@oracle.com> wrote: > I asked a question related to Marcelo's answer a few months ago. The > discussion there may be useful: > > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html > > > > On 06/02/2014 06:09 PM, Marcelo Vanzin wrote: > >> Hi Jamal, >> >> If what you want is to process lots of files in parallel, the best >> approach is probably to load all file names into an array and >> parallelize that. Then each task will take a path as input and can >> process it however it wants. >> >> Or you could write the file list to a file, and then use sc.textFile() >> to open it (assuming one path per line), and the rest is pretty much >> the same as above. >> >> It will probably be hard to process each individual file in parallel, >> unless mp3 and jpg files can be split into multiple blocks that can be >> processed separately. In that case, you'd need a custom (Hadoop) input >> format that is able to calculate the splits. But it doesn't sound like >> that's what you want. >> >> >> >> On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha <jamalsha...@gmail.com> >> wrote: >> >>> Hi, >>> How do one process for data sources other than text? >>> Lets say I have millions of mp3 (or jpeg) files and I want to use spark >>> to >>> process them? >>> How does one go about it. >>> >>> >>> I have never been able to figure this out.. >>> Lets say I have this library in python which works like following: >>> >>> import audio >>> >>> song = audio.read_mp3(filename) >>> >>> Then most of the methods are attached to song or maybe there is another >>> function which takes "song" type as an input. >>> >>> Maybe the above is just rambling.. but how do I use spark to process >>> (say) >>> audiio files. >>> Thanks >>> >> >> >> >