This is because of the RDD's lazy evaluation! Unless you force a
transformed (mapped/filtered/etc.) RDD to give you back some data (like
RDD.count) or output the data (like RDD.saveAsTextFile()), Spark will not
do anything.
So after the eventData.map(...), if you do take(10) and then print the
res
Hi all,
The vote has passed with 7 "+1" votes (4 binding) and 0 "-1" vote:
+1:
Xiangrui Meng*
Matei Zaharia*
DB Tsai
Reynold Xin*
Patrick Wendell*
Andrew Or
Sean McNamara
I'm closing this vote and going to package v0.9.2 today. Thanks
everyone for voting!
Best,
Xiangrui
On Fri, Jul 18, 2014 a
Shivaram,
You should take a look at this patch which adds support for naming
accumulators - this is likely to get merged in soon. I actually
started this patch by supporting named TaskMetrics similar to what you
have there, but then I realized there is too much semantic overlap
with accumulators,
Hi Christopher
Thanks for your reply. I'll try and address your points -- please let me
know if I missed anything.
Regarding clarifying the problem statement, let me try and do that with a
real-world example. I have a method that I want to measure the performance
of, which has the following signa
>From reading Neil's first e-mail, I think the motivation is to get some
metrics in ADAM ? -- I've run into a similar use-case with having
user-defined metrics in long-running tasks and I think a nice way to solve
this would be to have user-defined TaskMetrics.
To state my problem more clearly, l
I tried to map SparkFlumeEvents to String of RDDs like below. But that map and
call are not at all executed. I might be doing this in a wrong way. Any help
would be appreciated.
flumeStream.foreach(new Function,Void> () {
@Override
public Void call(JavaRDD eventsData)
Hi Reynold
Thanks for your reply.
Accumulators are, of course, stored in the Accumulators object as
thread-local variables. However, the Accumulators object isn't public, so
when a Task is executing there's no way to get the set of accumulators for
the current thread -- accumulators still have to
I created https://issues.apache.org/jira/browse/SPARK-2620 to track this.
Maybe useful to know, this is a regression on Spark 1.0.0. I tested the
same sample code on 0.9.1 and it worked (we have several jobs using case
classes as key aggregators, so it better does)
-kr, Gerard.
On Tue, Jul 22,
Hi All,
I am getting events from flume using following line.
JavaDStream flumeStream = FlumeUtils.createStream(ssc, host,
port);
Each event is a delimited record. I like to use some of the transformation
functions like map and reduce on this. Do I need to convert the
JavaDStream to JavaDSt
Just to narrow down the issue, it looks like the issue is in 'reduceByKey'
and derivates like 'distinct'.
groupByKey() seems to work
sc.parallelize(ps).map(x=> (x.name,1)).groupByKey().collect
res: Array[(String, Iterable[Int])] = Array((charly,ArrayBuffer(1)),
(abe,ArrayBuffer(1)), (bob,ArrayBuf
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0]
A minimal example:
case class P(name:String)
val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
[Spark shell local mode] res : Array[(P, Int)] = Ar
- Original Message -
> It could make sense to add a skipHeader argument to SparkContext.textFile?
I also looked into this. I don't think it's feasible given the limits of the
InputFormat and RecordReader interfaces. RecordReader can't (I think) *ever*
know which split it's attached
(I'm resending this mail since it seems that it was not sent. Sorry if this
was already sent.)
Hi,
A couple of month ago, I made a pull request to fix
https://issues.apache.org/jira/browse/SPARK-1825.
My pull request is here: https://github.com/apache/spark/pull/899
But that pull request
13 matches
Mail list logo