Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Henry Saputra
Signature and hash for source looks good No external executable package with source - good Compiled with git and maven - good Ran examples and sample programs locally and standalone -good +1 - Henry On Tue, May 20, 2014 at 1:13 PM, Tathagata Das wrote: > Please vote on releasing the following

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Matei Zaharia
+1 Tested it on both Windows and Mac OS X, with both Scala and Python. Confirmed that the issues in the previous RC were fixed. Matei On May 20, 2014, at 5:28 PM, Marcelo Vanzin wrote: > +1 (non-binding) > > I have: > - checked signatures and checksums of the files > - built the code from th

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-20 Thread Xiangrui Meng
Talked with Sandy and DB offline. I think the best solution is sending the secondary jars to the distributed cache of all containers rather than just the master, and set the classpath to include spark jar, primary app jar, and secondary jars before executor starts. In this way, user only needs to s

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Sandy Ryza
+1 On Tue, May 20, 2014 at 5:26 PM, Andrew Or wrote: > +1 > > > 2014-05-20 13:13 GMT-07:00 Tathagata Das : > > > Please vote on releasing the following candidate as Apache Spark version > > 1.0.0! > > > > This has a few bug fixes on top of rc9: > > SPARK-1875: https://github.com/apache/spark/pu

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Marcelo Vanzin
+1 (non-binding) I have: - checked signatures and checksums of the files - built the code from the git repo using both sbt and mvn (against hadoop 2.3.0) - ran a few simple jobs in local, yarn-client and yarn-cluster mode Haven't explicitly tested any of the recent fixes, streaming nor sql. On

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Andrew Or
+1 2014-05-20 13:13 GMT-07:00 Tathagata Das : > Please vote on releasing the following candidate as Apache Spark version > 1.0.0! > > This has a few bug fixes on top of rc9: > SPARK-1875: https://github.com/apache/spark/pull/824 > SPARK-1876: https://github.com/apache/spark/pull/819 > SPARK-1878

Re: Scala examples for Spark do not work as written in documentation

2014-05-20 Thread Andy Konwinski
I fixed the bug, but I kept the parameter "i" instead of "_" since that (1) keeps it more parallel to the python and java versions which also use functions with a named variable and (2) doesn't require readers to know this particular use of the "_" syntax in Scala. Thanks for catching this Glenn.

[VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Tathagata Das
Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few bug fixes on top of rc9: SPARK-1875: https://github.com/apache/spark/pull/824 SPARK-1876: https://github.com/apache/spark/pull/819 SPARK-1878: https://github.com/apache/spark/pull/822 SPARK-1879: https:/

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread GlennStrycker
For some reason it does not appear when I hit "tab" in Spark shell, but when I put everything together in one line, it DOES WORK! orig_graph.edges.map(_.copy()).cartesian(orig_graph.edges.map(_.copy())).flatMap( A => Seq(if (A._1.srcId == A._2.dstId) Edge(A._2.srcId,A._1.dstId,1) else if (A._1.dst

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Sean Owen
http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions It becomes automagically available when your RDD contains pairs. On Tue, May 20, 2014 at 9:00 PM, GlennStrycker wrote: > I don't seem to have this function in my Spark installation for this object, > or

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Mark Hamstra
That's all very old functionality in Spark terms, so it shouldn't have anything to do with your installation being out-of-date. There is also no need to cast as long as the relevant implicit conversions are in scope: import org.apache.spark.SparkContext._ On Tue, May 20, 2014 at 1:00 PM, GlennSt

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread GlennStrycker
I don't seem to have this function in my Spark installation for this object, or the classes MappedRDD, FlatMappedRDD, EdgeRDD, VertexRDD, or Graph. Which class should have the reduceByKey function, and how do I cast my current RDD as this class? Perhaps this is still due to my Spark installation

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Reynold Xin
You are probably looking for reduceByKey in that case. "reduce" just reduces everything in the collection into a single element. On Tue, May 20, 2014 at 12:16 PM, GlennStrycker wrote: > Wait a minute... doesn't a reduce function return 1 element PER key pair? > For example, word-count mapreduce

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread GlennStrycker
Wait a minute... doesn't a reduce function return 1 element PER key pair? For example, word-count mapreduce functions return a {word, count} element for every unique word. Is this supposed to be a 1-element RDD object? The .reduce function for a MappedRDD or FlatMappedRDD both are of the form

Re: Sorting partitions in Java

2014-05-20 Thread Madhu
Sean, No, I don't want to sort the whole RDD, sortByKey seems to be good enough for that. Right now, I think the code I have will work for me, but I can imagine conditions where it will run out of memory. I'm not completely sure if SPARK-983

Re: Sorting partitions in Java

2014-05-20 Thread Sean Owen
On Tue, May 20, 2014 at 6:10 PM, Madhu wrote: > What you suggest looks an in-memory sort, which is fine if each partition is > small enough to fit in memory. Is it true that rdd.sortByKey(...) requires > partitions to fit in memory? I wasn't sure if there was some magic behind > the scenes that su

Re: Sorting partitions in Java

2014-05-20 Thread Andrew Ash
Voted :) https://issues.apache.org/jira/browse/SPARK-983 On Tue, May 20, 2014 at 10:21 AM, Sandy Ryza wrote: > There is: SPARK-545 > > > On Tue, May 20, 2014 at 10:16 AM, Andrew Ash wrote: > > > Sandy, is there a Jira ticket for that? > > > > > > On Tue, May 20, 2014 at 10:12 AM, Sandy Ryza >

Re: Sorting partitions in Java

2014-05-20 Thread Sandy Ryza
There is: SPARK-545 On Tue, May 20, 2014 at 10:16 AM, Andrew Ash wrote: > Sandy, is there a Jira ticket for that? > > > On Tue, May 20, 2014 at 10:12 AM, Sandy Ryza >wrote: > > > sortByKey currently requires partitions to fit in memory, but there are > > plans to add external sort > > > > > >

Re: Sorting partitions in Java

2014-05-20 Thread Andrew Ash
Sandy, is there a Jira ticket for that? On Tue, May 20, 2014 at 10:12 AM, Sandy Ryza wrote: > sortByKey currently requires partitions to fit in memory, but there are > plans to add external sort > > > On Tue, May 20, 2014 at 10:10 AM, Madhu wrote: > > > Thanks Sean, I had seen that post you men

Re: Sorting partitions in Java

2014-05-20 Thread Sandy Ryza
sortByKey currently requires partitions to fit in memory, but there are plans to add external sort On Tue, May 20, 2014 at 10:10 AM, Madhu wrote: > Thanks Sean, I had seen that post you mentioned. > > What you suggest looks an in-memory sort, which is fine if each partition > is > small enough

Re: Sorting partitions in Java

2014-05-20 Thread Madhu
Thanks Sean, I had seen that post you mentioned. What you suggest looks an in-memory sort, which is fine if each partition is small enough to fit in memory. Is it true that rdd.sortByKey(...) requires partitions to fit in memory? I wasn't sure if there was some magic behind the scenes that support

Re: Sorting partitions in Java

2014-05-20 Thread Sean Owen
It's an Iterator in both Java and Scala. In both cases you need to copy the stream of values into something List-like to sort it. An Iterable would not change that (not sure the API can promise many iterations anyway). If you just want the equivalent of "toArray", you can use a utility method in C

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread GlennStrycker
Oh... ha, good point. Sorry, I'm new to mapreduce programming and forgot about that... I'll have to adjust my reduce function to output a vector/RDD as the element to return. Thanks for reminding me of this! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabbl

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-20 Thread Tom Graves
I assume we will have an rc10 to fix the issues Matei found? Tom On Sunday, May 18, 2014 9:08 PM, Patrick Wendell wrote: Hey Matei - the issue you found is not related to security. This patch a few days ago broke builds for Hadoop 1 with YARN support enabled. The patch directly altered the

Sorting partitions in Java

2014-05-20 Thread Madhu
I'm trying to sort data in each partition of an RDD. I was able to do it successfully in Scala like this: val sorted = rdd.mapPartitions(iter => { iter.toArray.sortWith((x, y) => x._2.compare(y._2) < 0).iterator }, preservesPartitioning = true) I used the same technique as in OrderedRDDFunction