Thank you for the suggestion. This would, however, involve converting my Dataframe to an RDD (and back later), which involves additional costs.
On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh <raghavendr...@gmail.com> wrote: > you can groupBy(country). and use mapPartitions method in which you can > iterate over all rows keeping 2 variables for maxPopulationSoFar and > corresponding city. Then return the city with max population. > I think as others suggested, it may be possible to use Bucketing, it would > give a more friendly SQL'ish way of doing and but not be the best in > performance as it needs to order/sort. > -- > Raghavendra > > > On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello, >> >> How can I retain from each group only the row for which one value is >> the maximum of the group? For example, imagine a DataFrame containing all >> major cities in the world, with three columns: (1) City name (2) Country >> (3) population. How would I get a DataFrame that only contains the largest >> city in each country? Thanks! >> >> Best, Oliver >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flannick >> Lab <http://www.flannicklab.org/>, Broad Institute >> <http://www.broadinstitute.org/> >> > -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad Institute <http://www.broadinstitute.org/>