How to get top N elements in a DataSet?

Ivan Mushketyk Tue, 24 Jan 2017 01:12:03 -0800

Hi,

I have a dataset of tuples with two fields ids and ratings and I need to
find 10 elements with the highest rating in this dataset. I found a
solution, but I think it's suboptimal and I think there should be a better
way to do it.


The best thing that I came up with is to partition dataset by rating, sort
locally and write the partitioned dataset to disk:

dataset
.partitionCustom(new Partitioner<Double>() {
  @Override
  public int partition(Double key, int numPartitions) {
    return key.intValue() % numPartitions;
  }
}, 1) . // partition by rating
.setParallelism(5)
.sortPartition(1, Order.DESCENDING) // locally sort by rating
.writeAsText("..."); // write the partitioned dataset to disk

This will store tuples in sorted files with names 5, 4, 3, ... that contain
ratings in ranges (5, 4], (4, 3], and so on. Then I can read sorted data
from disk and and N elements with the highest rating.
Is there a way to do the same but without writing a partitioned dataset to
a disk?

I tried to use "first(10)" but it seems to give top 10 items from a random
partition. Is there a way to get top N elements from every partition? Then
I could locally sort top values from every partition and find top 10 global
values.

Best regards,
Ivan.

How to get top N elements in a DataSet?

Reply via email to