Re: Tranforming flume events using Spark transformation functions

2014-07-22 Thread Tathagata Das
This is because of the RDD's lazy evaluation! Unless you force a transformed (mapped/filtered/etc.) RDD to give you back some data (like RDD.count) or output the data (like RDD.saveAsTextFile()), Spark will not do anything. So after the eventData.map(...), if you do take(10) and then print the res

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-22 Thread Xiangrui Meng
Hi all, The vote has passed with 7 "+1" votes (4 binding) and 0 "-1" vote: +1: Xiangrui Meng* Matei Zaharia* DB Tsai Reynold Xin* Patrick Wendell* Andrew Or Sean McNamara I'm closing this vote and going to package v0.9.2 today. Thanks everyone for voting! Best, Xiangrui On Fri, Jul 18, 2014 a

Re: "Dynamic variables" in Spark

2014-07-22 Thread Patrick Wendell
Shivaram, You should take a look at this patch which adds support for naming accumulators - this is likely to get merged in soon. I actually started this patch by supporting named TaskMetrics similar to what you have there, but then I realized there is too much semantic overlap with accumulators,

Re: "Dynamic variables" in Spark

2014-07-22 Thread Neil Ferguson
Hi Christopher Thanks for your reply. I'll try and address your points -- please let me know if I missed anything. Regarding clarifying the problem statement, let me try and do that with a real-world example. I have a method that I want to measure the performance of, which has the following signa

Re: "Dynamic variables" in Spark

2014-07-22 Thread Shivaram Venkataraman
>From reading Neil's first e-mail, I think the motivation is to get some metrics in ADAM ? -- I've run into a similar use-case with having user-defined metrics in long-running tasks and I think a nice way to solve this would be to have user-defined TaskMetrics. To state my problem more clearly, l

RE: Tranforming flume events using Spark transformation functions

2014-07-22 Thread Sundaram, Muthu X.
I tried to map SparkFlumeEvents to String of RDDs like below. But that map and call are not at all executed. I might be doing this in a wrong way. Any help would be appreciated. flumeStream.foreach(new Function,Void> () { @Override public Void call(JavaRDD eventsData)

Re: "Dynamic variables" in Spark

2014-07-22 Thread Neil Ferguson
Hi Reynold Thanks for your reply. Accumulators are, of course, stored in the Accumulators object as thread-local variables. However, the Accumulators object isn't public, so when a Task is executing there's no way to get the set of accumulators for the current thread -- accumulators still have to

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
I created https://issues.apache.org/jira/browse/SPARK-2620 to track this. Maybe useful to know, this is a regression on Spark 1.0.0. I tested the same sample code on 0.9.1 and it worked (we have several jobs using case classes as key aggregators, so it better does) -kr, Gerard. On Tue, Jul 22,

Tranforming flume events using Spark transformation functions

2014-07-22 Thread Sundaram, Muthu X.
Hi All, I am getting events from flume using following line. JavaDStream flumeStream = FlumeUtils.createStream(ssc, host, port); Each event is a delimited record. I like to use some of the transformation functions like map and reduce on this. Do I need to convert the JavaDStream to JavaDSt

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
Just to narrow down the issue, it looks like the issue is in 'reduceByKey' and derivates like 'distinct'. groupByKey() seems to work sc.parallelize(ps).map(x=> (x.name,1)).groupByKey().collect res: Array[(String, Iterable[Int])] = Array((charly,ArrayBuffer(1)), (abe,ArrayBuffer(1)), (bob,ArrayBuf

Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0] A minimal example: case class P(name:String) val ps = Array(P("alice"), P("bob"), P("charly"), P("bob")) sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Ar

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-22 Thread Erik Erlandson
- Original Message - > It could make sense to add a skipHeader argument to SparkContext.textFile? I also looked into this. I don't think it's feasible given the limits of the InputFormat and RecordReader interfaces. RecordReader can't (I think) *ever* know which split it's attached

Suggestion for SPARK-1825

2014-07-22 Thread innowireless TaeYun Kim
(I'm resending this mail since it seems that it was not sent. Sorry if this was already sent.) Hi, A couple of month ago, I made a pull request to fix https://issues.apache.org/jira/browse/SPARK-1825. My pull request is here: https://github.com/apache/spark/pull/899 But that pull request