Hi Ivan, I think you can use MapPartition for that. So basically:
dataset // assuming some partitioning that can be reused to avoid a shuffle .sortPartition(1, Order.DESCENDING) .mapPartition(new ReturnFirstTen()) .sortPartition(1, Order.DESCENDING).parallelism(1) .mapPartition(new ReturnFirstTen()) Best, Fabian 2017-01-24 10:10 GMT+01:00 Ivan Mushketyk <ivan.mushke...@gmail.com>: > Hi, > > I have a dataset of tuples with two fields ids and ratings and I need to > find 10 elements with the highest rating in this dataset. I found a > solution, but I think it's suboptimal and I think there should be a better > way to do it. > > The best thing that I came up with is to partition dataset by rating, sort > locally and write the partitioned dataset to disk: > > dataset > .partitionCustom(new Partitioner<Double>() { > @Override > public int partition(Double key, int numPartitions) { > return key.intValue() % numPartitions; > } > }, 1) . // partition by rating > .setParallelism(5) > .sortPartition(1, Order.DESCENDING) // locally sort by rating > .writeAsText("..."); // write the partitioned dataset to disk > > This will store tuples in sorted files with names 5, 4, 3, ... that > contain ratings in ranges (5, 4], (4, 3], and so on. Then I can read sorted > data from disk and and N elements with the highest rating. > Is there a way to do the same but without writing a partitioned dataset to > a disk? > > I tried to use "first(10)" but it seems to give top 10 items from a random > partition. Is there a way to get top N elements from every partition? Then > I could locally sort top values from every partition and find top 10 global > values. > > Best regards, > Ivan. > > >