key, then sortWithinPartition, and the
groupByKey. Since data are already hash-partitioned by key, Spark
should not shuffle the data hence change the sort wihtin each partition:
ds.repartition($"key").sortWithinPartitions($"code").groupBy($"key")
Enrico
Am 26.03.20 um
Hi,
I have a dataframe which has data like:
key | code | code_value
1 | c1 | 11
1 | c2 | 12
1 | c2 | 9
1 | c3
Hi all,
I want to collect some rows in a list by using the spark's collect_list
function.
However, the no. of rows getting in the list is overflowing the memory.
Is there any way to force the collection of rows onto the disk rather
than in memory, or else instead of collecting it as a list,