I am trying to run the data on spark-shell mode to find whether there is
something wrong in the code or data. As I could only reproduce the error on
a 50B edge graph.
2015-02-13 14:50 GMT+08:00 Reynold Xin :
> Then maybe you actually had a null in your vertex attribute?
>
>
> On Thu, Feb 12, 2015
Then maybe you actually had a null in your vertex attribute?
On Thu, Feb 12, 2015 at 10:47 PM, James wrote:
> I changed the mapReduceTriplets() func to aggregateMessages(), but it
> still failed.
>
>
> 2015-02-13 6:52 GMT+08:00 Reynold Xin :
>
>> Can you use the new aggregateNeighbors method? I
I changed the mapReduceTriplets() func to aggregateMessages(), but it still
failed.
2015-02-13 6:52 GMT+08:00 Reynold Xin :
> Can you use the new aggregateNeighbors method? I suspect the null is
> coming from "automatic join elimination", which detects bytecode to see if
> you need the src or ds
Just to summarize this thread, I was finally able to make all performance
comparisons that we discussed. It turns out that:
BIDMat-cublas>>BIDMat
MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
Below is the link to the spreadsheet with full
Matei wrote (Jan 26, 2015; 5:31pm): "The intent of Spark SQL though is to be
more than a SQL server -- it's meant to be a library for manipulating
structured data."
I think this is an important but nuanced point. There are engineers who for
various reasons associate the term "SQL" with business an
Can you use the new aggregateNeighbors method? I suspect the null is coming
from "automatic join elimination", which detects bytecode to see if you
need the src or dst vertex data. Occasionally it can fail to detect. In the
new aggregateNeighbors API, the caller needs to explicitly specifying that,
This is super helpful, thanks Evan and Reynold!
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607p10610.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
-
Evan articulated it well.
On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks
wrote:
> Well, you can always join as many RDDs as you want by chaining them
> together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
> RDDs in this way but 10 is probably doable.
>
> That said - Spar
Well, you can always join as many RDDs as you want by chaining them
together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
RDDs in this way but 10 is probably doable.
That said - SparkSQL has an optimizer under the covers that can make clever
decisions e.g. pushing the predica
My team is building a batch data processing pipeline using Spark API and
trying to understand if Spark SQL can help us. Below are what we found so
far:
- SQL's declarative style may be more readable in some cases (e.g. joining
of more than two RDDs), although some devs prefer the fluent style
rega
Hello,
When I am running the code on a much bigger size graph, I met
NullPointerException.
I found that is because the sendMessage() function receive a triplet that
edge.srcAttr or edge.dstAttr is null. Thus I wonder why it will happen as I
am sure every vertices have a attr.
Any returns is appr
Let me start with a version "2+" tag and at least write in the
description that it's only for issues that are clearly to be fixed,
but must wait until 2.x.
On Thu, Feb 12, 2015 at 8:54 AM, Patrick Wendell wrote:
> Yeah my preferred is also having a more open ended "2+" for issues
> that are clear
Yeah my preferred is also having a more open ended "2+" for issues
that are clearly desirable but blocked by compatibility concerns.
What I would really want to avoid is major feature proposals sitting
around in our JIRA and tagged under some 2.X version. IMO JIRA isn't
the place for thoughts abou
It seems to me having a version that is 2+ is good for that? Once we move
to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1
or 2.1.0 .
On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen wrote:
> Patrick and I were chatting about how to handle several issues which
> clearly need
Patrick and I were chatting about how to handle several issues which
clearly need a fix, and are easy, but can't be implemented until a
next major release like Spark 2.x since it would change APIs.
Examples:
https://issues.apache.org/jira/browse/SPARK-3266
https://issues.apache.org/jira/browse/SPA
Hi, patrick
Really glad to get your reply.
Yes, we are doing group by operations for our work. We know that this is common
for growTable when processing large data sets.
The problem actually goes to : Do we have any possible chance to self-modify
the initialCapacity using specifically for our
It will create and connect to new executors. The executors are mostly
stateless, so the program can resume with new executors.
On Wed, Feb 11, 2015 at 11:24 PM, lin wrote:
> Hi, all
>
> In Spark Streaming 1.2.0, when the driver fails and a new driver starts
> with the most updated check-pointed d
The map will start with a capacity of 64, but will grow to accommodate
new data. Are you using the groupBy operator in Spark or are you using
Spark SQL's group by? This usually happens if you are grouping or
aggregating in a way that doesn't sufficiently condense the data
created from each input pa
18 matches
Mail list logo