Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre
eOf[DeviceKey] > | k.serialNum.hashCode() % numPartitions > | } > | }defined class DeviceKeyPartitioner > > scala> > > scala> t.repartitionAndSortWithinPartitions(new DeviceKeyPartitioner(2))res0: > org.apache.spark.rdd.RDD[(DeviceKey, Int)] = Shuffl

Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Yong Zhang
la> t.repartitionAndSortWithinPartitions(new DeviceKeyPartitioner(2)) res0: org.apache.spark.rdd.RDD[(DeviceKey, Int)] = ShuffledRDD[1] at repartitionAndSortWithinPartitions at :30 Yong From: Pariksheet Barapatre Sent: Wednesday, March 29, 2017 9:02 AM To: user

Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre
Hi, <http://stackoverflow.com/questions/43038682/secondary-sort-using-apache-spark-1-6#> I am referring web link http://codingjunkie.net/spark-secondary-sort/ to implement secondary sort in my spark job. I have defined my key case class as case class DeviceKey(serialNum: String, eve

GroupedDataset flatMapGroups with sorting (aka secondary sort redux)

2016-02-12 Thread Koert Kuipers
is there a way to leverage the shuffle in Dataset/GroupedDataset so that Iterator[V] in flatMapGroups has a well defined ordering? is hard for me to see many good use cases for flatMapGroups and mapGroups if you do not have sorting. since spark has a sort based shuffle not exposing this would be

Re: What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread Kevin Jung
You should create key as tuple type. In your case, RDD[((id, timeStamp) , value)] is the proper way to do. Kevin --- Original Message --- Sender : swetha Date : 2015-08-12 09:37 (GMT+09:00) Title : What is the optimal approach to do Secondary Sort in Spark? Hi, What is the optimal

What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread swetha
Hi, What is the optimal approach to do Secondary sort in Spark? I have to first Sort by an Id in the key and further sort it by timeStamp which is present in the value. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-optimal

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
values again. basically this is what the original hadoop >> reduce operation did so well: it allowed sorting of values (using secondary >> sort), and it processed all values per key in a streaming fashion. >> >> the library spark-sorted aims to bring these kind of oper

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Burak Yavuz
uce-side, even when they do not fit in memory. examples are > algorithms that need to process the values ordered, or algorithms that need > to emit all values again. basically this is what the original hadoop reduce > operation did so well: it allowed sorting of values (using secondary sor

spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
. basically this is what the original hadoop reduce operation did so well: it allowed sorting of values (using secondary sort), and it processed all values per key in a streaming fashion. the library spark-sorted aims to bring these kind of operations back to spark, by providing a way to process values

Re: secondary sort

2014-09-22 Thread Koert Kuipers
: > >> now that spark has a sort based shuffle, can we expect a secondary sort >> soon? there are some use cases where getting a sorted iterator of values >> per key is helpful. >> > >

Re: secondary sort

2014-09-22 Thread Daniil Osipov
Adding an issue in JIRA would help keep track of the feature request: https://issues.apache.org/jira/browse/SPARK On Sat, Sep 20, 2014 at 7:39 AM, Koert Kuipers wrote: > now that spark has a sort based shuffle, can we expect a secondary sort > soon? there are some use cases where get

secondary sort

2014-09-20 Thread Koert Kuipers
now that spark has a sort based shuffle, can we expect a secondary sort soon? there are some use cases where getting a sorted iterator of values per key is helpful.