Hi, I have a dataset of tuples with two fields ids and ratings and I need to find 10 elements with the highest rating in this dataset. I found a solution, but I think it's suboptimal and I think there should be a better way to do it.
The best thing that I came up with is to partition dataset by rating, sort locally and write the partitioned dataset to disk: dataset .partitionCustom(new Partitioner<Double>() { @Override public int partition(Double key, int numPartitions) { return key.intValue() % numPartitions; } }, 1) . // partition by rating .setParallelism(5) .sortPartition(1, Order.DESCENDING) // locally sort by rating .writeAsText("..."); // write the partitioned dataset to disk This will store tuples in sorted files with names 5, 4, 3, ... that contain ratings in ranges (5, 4], (4, 3], and so on. Then I can read sorted data from disk and and N elements with the highest rating. Is there a way to do the same but without writing a partitioned dataset to a disk? I tried to use "first(10)" but it seems to give top 10 items from a random partition. Is there a way to get top N elements from every partition? Then I could locally sort top values from every partition and find top 10 global values. Best regards, Ivan.