Re: Hadoop streaming like feature for Spark

2014-03-20 Thread Ewen Cheslack-Postava
Take a look at RDD.pipe(). You could also accomplish the same thing using RDD.mapPartitions, which you pass a function that processes the iterator for each partition rather than processing each element individually. This lets you only start up as many processes as there are partitions, pipe the

Re: Running Spark on a single machine

2014-03-16 Thread Ewen Cheslack-Postava
Those pages include instructions for running locally: "Note that all of the sample programs take a parameter specifying the cluster URL to connect to. This can be a URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should st

Re: new user question on using scala collections inside RDDs

2014-03-14 Thread Ewen Cheslack-Postava
Code in a transformation (i.e. inside the function passed to RDD.map() or RDD.filter()) will run on workers, not the driver. They will run in parallel. In Spark, the driver actually doesn't do much -- it just builds up a description of the computation to be performed and then sends it off to th

Re: Are all transformations lazy?

2014-03-11 Thread Ewen Cheslack-Postava
mputation transformations. May be this is due to my ignorance of how Scala works, but when I see the code, I see that the function is applied to the elements of RDD when I call distinct - or is it not applied immediately? How does the returned RDD 'keep track of the operation'?

Re: Are all transformations lazy?

2014-03-11 Thread Ewen Cheslack-Postava
You should probably be asking the opposite question: why do you think it *should* be applied immediately? Since the driver program hasn't requested any data back (distinct generates a new RDD, it doesn't return any data), there's no need to actually compute anything yet. As the documentation d

Re: specify output format using pyspark

2014-02-26 Thread Ewen Cheslack-Postava
You need to convert it to the format you want yourself. The output you're seeing is just the automatic conversion of your data by unicode(). -Ewen Chengi Liu February 26, 2014 at 9:43 AM Hi,  How do we save data to hdfs using pyspark in "right" format.I use:counts = cou

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Ewen Cheslack-Postava
You must be parsing each line of the file at some point anyway, so adding a step to filter out the header should work fine. It'll get executed at the same time as your parsing/conversion to ints, so there's no significant overhead aside from the check itself. For standalone programs, there's a

Re: Filter on Date by comparing

2014-02-24 Thread Ewen Cheslack-Postava
Or use RDD.filterWith to create whatever you need out of serializable parts so you only run it once per partition. Andrew Ash February 24, 2014 at 7:17 PM This is because Joda's DateTimeFormatter is not serializable (doesn't implement the empty Serializable interface) ht