Hi,
How do one process for data sources other than text?
Lets say I have millions of mp3 (or jpeg) files and I want to use spark to
process them?
How does one go about it.
I have never been able to figure this out..
Lets say I have this library in python which works like following:
import audi
to calculate the splits. But it doesn't sound like
> that's what you want.
>
>
>
> On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha wrote:
> > Hi,
> > How do one process for data sources other than text?
> > Lets say I have millions of mp3 (or jpeg) files and
h individual file in parallel,
>> unless mp3 and jpg files can be split into multiple blocks that can be
>> processed separately. In that case, you'd need a custom (Hadoop) input
>> format that is able to calculate the splits. But it doesn't sound like
>> that's wha
file
>
> files = [ "file1", "file2" ]
> sc.parallelize(files).foreach(processSingleFile)
>
>
> On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha wrote:
> > Hi Marcelo,
> > Thanks for the response..
> > I am not sure I understand. Can you elaborat
ar foo
this is text2
rdd_files = sc.parallelize(files).foreach(read_file)
Now, I am hoping to get from this is the lines (probably unordered)
But rdd_files.take(2) doesnt return anything (take method is not defined on
this)
How do i do this?
On Mon, Jun 2, 2014 at 5:29 PM, jamal sasha wrote
Hi,
I have bunch of vectors like
[0.1234,-0.231,0.23131]
and so on.
and I want to compute cosine similarity and pearson correlation using
pyspark..
How do I do this?
Any ideas?
Thanks