Thanks Daniel. I can understand that the keys will not be in sorted order but what I am trying to understanding is whether the functions are passed values in sorted order in a given partition.
For example: sc.parallelize(1 to 8).map(i => (1, i)).sortBy(t => t._2).foldByKey(0)((a, b) => b).collect res0: Array[(Int, Int)] = Array((1,8)) The fold always given me last value as 8 which suggests values preserve sorting earlier defined in stage in DAG? On Wed Nov 19 2014 at 18:10:11 Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Akhil, I think Aniket uses the word "persisted" in a different way than > what you mean. I.e. not in the RDD.persist() way. Aniket asks if running > combineByKey on a sorted RDD will result in a sorted RDD. (I.e. the sorting > is preserved.) > > I think the answer is no. combineByKey uses AppendOnlyMap, which is a > hashmap. That will shuffle your keys. You can quickly verify it in > spark-shell: > > scala> sc.parallelize(7 to 8).map(_ -> 1).reduceByKey(_ + _).collect > res0: Array[(Int, Int)] = Array((8,1), (7,1)) > > (The initial size of the AppendOnlyMap seems to be 8, so 8 is the first > number that demonstrates this.) > > On Wed, Nov 19, 2014 at 9:05 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> If something is persisted you can easily see them under the Storage tab >> in the web ui. >> >> Thanks >> Best Regards >> >> On Tue, Nov 18, 2014 at 7:26 PM, Aniket Bhatnagar < >> aniket.bhatna...@gmail.com> wrote: >> >>> I am trying to figure out if sorting is persisted after applying Pair >>> RDD transformations and I am not able to decisively tell after reading the >>> documentation. >>> >>> For example: >>> val numbers = .. // RDD of numbers >>> val pairedNumbers = numbers.map(number => (number % 100, number)) >>> val sortedPairedNumbers = pairedNumbers.sortBy(pairedNumber => >>> pairedNumber._2) // Sort by values in the pair >>> val aggregates = sortedPairedNumbers.combineByKey(..) >>> >>> In this example, will the combine functions see values in sorted order? >>> What if I had done groupByKey and then combineByKey? What transformations >>> can unsort an already sorted data? >>> >> >> >