eOf[DeviceKey]
> | k.serialNum.hashCode() % numPartitions
> | }
> | }defined class DeviceKeyPartitioner
>
> scala>
>
> scala> t.repartitionAndSortWithinPartitions(new DeviceKeyPartitioner(2))res0:
> org.apache.spark.rdd.RDD[(DeviceKey, Int)] = Shuffl
la> t.repartitionAndSortWithinPartitions(new DeviceKeyPartitioner(2))
res0: org.apache.spark.rdd.RDD[(DeviceKey, Int)] = ShuffledRDD[1] at
repartitionAndSortWithinPartitions at :30
Yong
From: Pariksheet Barapatre
Sent: Wednesday, March 29, 2017 9:02 AM
To: user
Hi,
<http://stackoverflow.com/questions/43038682/secondary-sort-using-apache-spark-1-6#>
I am referring web link http://codingjunkie.net/spark-secondary-sort/ to
implement secondary sort in my spark job.
I have defined my key case class as
case class DeviceKey(serialNum: String, eve
is there a way to leverage the shuffle in Dataset/GroupedDataset so that
Iterator[V] in flatMapGroups has a well defined ordering?
is hard for me to see many good use cases for flatMapGroups and mapGroups
if you do not have sorting.
since spark has a sort based shuffle not exposing this would be
You should create key as tuple type. In your case, RDD[((id, timeStamp) ,
value)] is the proper way to do.
Kevin
--- Original Message ---
Sender : swetha
Date : 2015-08-12 09:37 (GMT+09:00)
Title : What is the optimal approach to do Secondary Sort in Spark?
Hi,
What is the optimal
Hi,
What is the optimal approach to do Secondary sort in Spark? I have to first
Sort by an Id in the key and further sort it by timeStamp which is present
in the value.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-optimal
values again. basically this is what the original hadoop
>> reduce operation did so well: it allowed sorting of values (using secondary
>> sort), and it processed all values per key in a streaming fashion.
>>
>> the library spark-sorted aims to bring these kind of oper
uce-side, even when they do not fit in memory. examples are
> algorithms that need to process the values ordered, or algorithms that need
> to emit all values again. basically this is what the original hadoop reduce
> operation did so well: it allowed sorting of values (using secondary sor
. basically this is what the original hadoop reduce
operation did so well: it allowed sorting of values (using secondary sort),
and it processed all values per key in a streaming fashion.
the library spark-sorted aims to bring these kind of operations back to
spark, by providing a way to process values
:
>
>> now that spark has a sort based shuffle, can we expect a secondary sort
>> soon? there are some use cases where getting a sorted iterator of values
>> per key is helpful.
>>
>
>
Adding an issue in JIRA would help keep track of the feature request:
https://issues.apache.org/jira/browse/SPARK
On Sat, Sep 20, 2014 at 7:39 AM, Koert Kuipers wrote:
> now that spark has a sort based shuffle, can we expect a secondary sort
> soon? there are some use cases where get
now that spark has a sort based shuffle, can we expect a secondary sort
soon? there are some use cases where getting a sorted iterator of values
per key is helpful.
12 matches
Mail list logo