Processing audio/video/images

2014-06-02 Thread jamal sasha
Hi, How do one process for data sources other than text? Lets say I have millions of mp3 (or jpeg) files and I want to use spark to process them? How does one go about it. I have never been able to figure this out.. Lets say I have this library in python which works like following: import audi

Re: Processing audio/video/images

2014-06-02 Thread jamal sasha
to calculate the splits. But it doesn't sound like > that's what you want. > > > > On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha wrote: > > Hi, > > How do one process for data sources other than text? > > Lets say I have millions of mp3 (or jpeg) files and

Re: Processing audio/video/images

2014-06-02 Thread jamal sasha
h individual file in parallel, >> unless mp3 and jpg files can be split into multiple blocks that can be >> processed separately. In that case, you'd need a custom (Hadoop) input >> format that is able to calculate the splits. But it doesn't sound like >> that's wha

Re: Processing audio/video/images

2014-06-02 Thread jamal sasha
file > > files = [ "file1", "file2" ] > sc.parallelize(files).foreach(processSingleFile) > > > On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha wrote: > > Hi Marcelo, > > Thanks for the response.. > > I am not sure I understand. Can you elaborat

Re: Processing audio/video/images

2014-06-19 Thread jamal sasha
ar foo this is text2 rdd_files = sc.parallelize(files).foreach(read_file) Now, I am hoping to get from this is the lines (probably unordered) But rdd_files.take(2) doesnt return anything (take method is not defined on this) How do i do this? On Mon, Jun 2, 2014 at 5:29 PM, jamal sasha wrote

Computing cosine similiarity using pyspark

2014-05-22 Thread jamal sasha
Hi, I have bunch of vectors like [0.1234,-0.231,0.23131] and so on. and I want to compute cosine similarity and pearson correlation using pyspark.. How do I do this? Any ideas? Thanks