So..
Here is my experimental code to get a feel of it
def read_file(filename):
with open(filename) as f:
lines = [ line for line in f]
return lines
files = ["/somepath/.../test1.txt","sompath/.../test2.txt"]
test1.txt has
foo bar
this is test1
test2.txt
bar foo
this is text2
Phoofff.. (Mind blown)...
Thank you sir.
This is awesome
On Mon, Jun 2, 2014 at 5:23 PM, Marcelo Vanzin wrote:
> The idea is simple. If you want to run something on a collection of
> files, do (in pseudo-python):
>
> def processSingleFile(path):
> # Your code to process a file
>
> files = [ "
The idea is simple. If you want to run something on a collection of
files, do (in pseudo-python):
def processSingleFile(path):
# Your code to process a file
files = [ "file1", "file2" ]
sc.parallelize(files).foreach(processSingleFile)
On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha wrote:
> Hi M
Hi Marcelo,
Thanks for the response..
I am not sure I understand. Can you elaborate a bit.
So, for example, lets take a look at this example
http://pythonvision.org/basic-tutorial
import mahotas
dna = mahotas.imread('dna.jpeg')
dnaf = ndimage.gaussian_filter(dna, 8)
But except dna.jpeg Lets say
Thanks. Let me go thru it.
On Mon, Jun 2, 2014 at 5:15 PM, Philip Ogren
wrote:
> I asked a question related to Marcelo's answer a few months ago. The
> discussion there may be useful:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html
>
>
>
> On 06/02/2014 06:09 PM, Mar
I asked a question related to Marcelo's answer a few months ago. The
discussion there may be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html
On 06/02/2014 06:09 PM, Marcelo Vanzin wrote:
Hi Jamal,
If what you want is to process lots of files in parallel, the b
Hi Jamal,
If what you want is to process lots of files in parallel, the best
approach is probably to load all file names into an array and
parallelize that. Then each task will take a path as input and can
process it however it wants.
Or you could write the file list to a file, and then use sc.te
Hi,
How do one process for data sources other than text?
Lets say I have millions of mp3 (or jpeg) files and I want to use spark to
process them?
How does one go about it.
I have never been able to figure this out..
Lets say I have this library in python which works like following:
import audi