Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread James
I am trying to run the data on spark-shell mode to find whether there is something wrong in the code or data. As I could only reproduce the error on a 50B edge graph. 2015-02-13 14:50 GMT+08:00 Reynold Xin : > Then maybe you actually had a null in your vertex attribute? > > > On Thu, Feb 12, 2015

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread Reynold Xin
Then maybe you actually had a null in your vertex attribute? On Thu, Feb 12, 2015 at 10:47 PM, James wrote: > I changed the mapReduceTriplets() func to aggregateMessages(), but it > still failed. > > > 2015-02-13 6:52 GMT+08:00 Reynold Xin : > >> Can you use the new aggregateNeighbors method? I

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread James
I changed the mapReduceTriplets() func to aggregateMessages(), but it still failed. 2015-02-13 6:52 GMT+08:00 Reynold Xin : > Can you use the new aggregateNeighbors method? I suspect the null is > coming from "automatic join elimination", which detects bytecode to see if > you need the src or ds

RE: Using CUDA within Spark / boosting linear algebra

2015-02-12 Thread Ulanov, Alexander
Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublas>>BIDMat MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas Below is the link to the spreadsheet with full

Re: renaming SchemaRDD -> DataFrame

2015-02-12 Thread vha14
Matei wrote (Jan 26, 2015; 5:31pm): "The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data." I think this is an important but nuanced point. There are engineers who for various reasons associate the term "SQL" with business an

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread Reynold Xin
Can you use the new aggregateNeighbors method? I suspect the null is coming from "automatic join elimination", which detects bytecode to see if you need the src or dst vertex data. Occasionally it can fail to detect. In the new aggregateNeighbors API, the caller needs to explicitly specifying that,

Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread vha14
This is super helpful, thanks Evan and Reynold! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607p10610.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -

Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread Reynold Xin
Evan articulated it well. On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks wrote: > Well, you can always join as many RDDs as you want by chaining them > together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of > RDDs in this way but 10 is probably doable. > > That said - Spar

Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread Evan R. Sparks
Well, you can always join as many RDDs as you want by chaining them together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of RDDs in this way but 10 is probably doable. That said - SparkSQL has an optimizer under the covers that can make clever decisions e.g. pushing the predica

Spark SQL value proposition in batch pipelines

2015-02-12 Thread vha14
My team is building a batch data processing pipeline using Spark API and trying to understand if Spark SQL can help us. Below are what we found so far: - SQL's declarative style may be more readable in some cases (e.g. joining of more than two RDDs), although some devs prefer the fluent style rega

Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread James
Hello, When I am running the code on a much bigger size graph, I met NullPointerException. I found that is because the sendMessage() function receive a triplet that edge.srcAttr or edge.dstAttr is null. Thus I wonder why it will happen as I am sure every vertices have a attr. Any returns is appr

Re: How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Sean Owen
Let me start with a version "2+" tag and at least write in the description that it's only for issues that are clearly to be fixed, but must wait until 2.x. On Thu, Feb 12, 2015 at 8:54 AM, Patrick Wendell wrote: > Yeah my preferred is also having a more open ended "2+" for issues > that are clear

Re: How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Patrick Wendell
Yeah my preferred is also having a more open ended "2+" for issues that are clearly desirable but blocked by compatibility concerns. What I would really want to avoid is major feature proposals sitting around in our JIRA and tagged under some 2.X version. IMO JIRA isn't the place for thoughts abou

Re: How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Reynold Xin
It seems to me having a version that is 2+ is good for that? Once we move to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1 or 2.1.0 . On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen wrote: > Patrick and I were chatting about how to handle several issues which > clearly need

How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Sean Owen
Patrick and I were chatting about how to handle several issues which clearly need a fix, and are easy, but can't be implemented until a next major release like Spark 2.x since it would change APIs. Examples: https://issues.apache.org/jira/browse/SPARK-3266 https://issues.apache.org/jira/browse/SPA

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread fightf...@163.com
Hi, patrick Really glad to get your reply. Yes, we are doing group by operations for our work. We know that this is common for growTable when processing large data sets. The problem actually goes to : Do we have any possible chance to self-modify the initialCapacity using specifically for our

Re: driver fail-over in Spark streaming 1.2.0

2015-02-12 Thread Patrick Wendell
It will create and connect to new executors. The executors are mostly stateless, so the program can resume with new executors. On Wed, Feb 11, 2015 at 11:24 PM, lin wrote: > Hi, all > > In Spark Streaming 1.2.0, when the driver fails and a new driver starts > with the most updated check-pointed d

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell
The map will start with a capacity of 64, but will grow to accommodate new data. Are you using the groupBy operator in Spark or are you using Spark SQL's group by? This usually happens if you are grouping or aggregating in a way that doesn't sufficiently condense the data created from each input pa