Re: Pyspark DataFrame TypeError

2015-09-08 Thread Prabeesh K.
Thanks for the reply. after rebuild now it looks good. On 8 September 2015 at 22:38, Davies Liu wrote: > I tried with Python 2.7/3.4 and Spark 1.4.1/1.5-RC3, they all work as > expected: > > ``` > >>> from pyspark.mllib.linalg import Vectors > >>> df = sqlContext.createDataFrame([(1.0, Vectors.d

Re: groupByKey() and keys with many values

2015-09-08 Thread Reynold Xin
On Tue, Sep 8, 2015 at 6:51 AM, Antonio Piccolboni wrote: > As far as the DB writes, remember spark can retry a computation, so your > writes have to be idempotent (see this thread > , in > which Reynold is a bit optimistic about f

Re: Deserializing JSON into Scala objects in Java code

2015-09-08 Thread Marcelo Vanzin
Hi Kevin, This code works fine for me (output is "List(1, 2)"): import org.apache.spark.status.api.v1.RDDPartitionInfo; import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.module.scala.DefaultScalaModule; class jackson { public static void main(String[] args) throws

Re: Deserializing JSON into Scala objects in Java code

2015-09-08 Thread Kevin Chen
Hi Marcelo, Thanks for the quick response. I understand that I can just write my own Java classes (I will use that as a fallback option), but in order to avoid code duplication and further possible changes, I was hoping there would be a way to use the Spark API classes directly, since it seems th

Re: Deserializing JSON into Scala objects in Java code

2015-09-08 Thread Marcelo Vanzin
Hi Kevin, How did you try to use the Scala module? Spark has this code when setting up the ObjectMapper used to generate the output: mapper.registerModule(com.fasterxml.jackson.module.scala.DefaultScalaModule) As for supporting direct serialization to Java objects, I don't think that was the g

Deserializing JSON into Scala objects in Java code

2015-09-08 Thread Kevin Chen
Hello Spark Devs, I am trying to use the new Spark API json endpoints at /api/v1/[path] (added in SPARK-3454). In order to minimize maintenance on our end, I would like to use Retrofit/Jackson to parse the json directly into the Scala classes in org/apache/spark/status/api/v1/api.scala (Applica

Re: Fast Iteration while developing

2015-09-08 Thread Michael Armbrust
+1 to reynolds suggestion. This is probably the fastest way to iterate. Another option for more ad-hoc debugging is `sbt/sbt sparkShell` which is similar to bin/spark-shell but doesn't require you to rebuild the assembly jar. On Mon, Sep 7, 2015 at 9:03 PM, Reynold Xin wrote: > I usually write

Re: Pyspark DataFrame TypeError

2015-09-08 Thread Davies Liu
I tried with Python 2.7/3.4 and Spark 1.4.1/1.5-RC3, they all work as expected: ``` >>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([(1.0, Vectors.dense([1.0])), (0.0, >>> Vectors.sparse(1, [], []))], ["label", "featuers"]) >>> df.show() +-+-+ |label|

Re: groupByKey() and keys with many values

2015-09-08 Thread Antonio Piccolboni
You may also consider selecting distinct keys and fetching from database first, then join on key with values. This in case Sean's approach is not viable -- in case you need to have the DB data before the first reduce call. By not revealing your problem, you are forcing us to make guesses, which are

Question on DAGScheduler.getMissingParentStages()

2015-09-08 Thread Madhusudanan Kandasamy
Hi, I'm new to SPARK, trying to understand the DAGScheduler code flow. As per my understanding it looks like getMissingParentStages() doing a redundant job of re-calculating stage dependencies. When the first stage is created all of its dependent/parent stages would be recursively calculated and

Re: Detecting configuration problems

2015-09-08 Thread Madhu
Thanks Akhil! I suspect the root cause of the shuffle OOM I was seeing (and probably many that users might see) is due to individual partitions on the reduce side not fitting in memory. As a guideline, I was thinking of something like "be sure that your largest partitions occupy no more then 1% of

Re: Code generation for GPU

2015-09-08 Thread Steve Loughran
On 7 Sep 2015, at 20:44, lonikar mailto:loni...@gmail.com>> wrote: 2. If the vectorization is difficult or a major effort, I am not sure how I am going to implement even a glimpse of changes I would like to. I think I will have to satisfied with only a partial effort. Batching rows defeats the

Re: Detecting configuration problems

2015-09-08 Thread Akhil Das
I found an old JIRA referring the same. https://issues.apache.org/jira/browse/SPARK-5421 Thanks Best Regards On Sun, Sep 6, 2015 at 8:53 PM, Madhu wrote: > I'm not sure if this has been discussed already, if so, please point me to > the thread and/or related JIRA. > > I have been running with a

Re: groupByKey() and keys with many values

2015-09-08 Thread Sean Owen
I think groupByKey is intended for cases where you do want the values in memory; for one-pass use cases, it's more efficient to use reduceByKey, or aggregateByKey if lower-level operations are needed. For your case, you probably want to do you reduceByKey, then perform the expensive per-key lookup

adding jars to the classpath with the relative path to spark home

2015-09-08 Thread Niranda Perera
Hi, is it possible to add jars to the spark executor/ driver classpath with the relative path of the jar (relative to the spark home)? I need to set the following settings in the spark conf - spark.driver.extraClassPath - spark.executor.extraClassPath the reason why I need to use the relative pa