Re: [PySpark] Getting the best row from each group

Oliver Ruebenacker Tue, 20 Dec 2022 13:41:21 -0800

Thank you for the suggestion. This would, however, involve converting my
Dataframe to an RDD (and back later), which involves additional costs.


On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh <[email protected]>
wrote:

> you can groupBy(country). and use mapPartitions method in which you can
> iterate over all rows keeping 2 variables for maxPopulationSoFar and
> corresponding city. Then return the city with max population.
> I think as others suggested, it may be possible to use Bucketing, it would
> give a more friendly SQL'ish way of doing and but not be the best in
> performance as it needs to order/sort.
> --
> Raghavendra
>
>
> On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker <
> [email protected]> wrote:
>
>>
>>      Hello,
>>
>>   How can I retain from each group only the row for which one value is
>> the maximum of the group? For example, imagine a DataFrame containing all
>> major cities in the world, with three columns: (1) City name (2) Country
>> (3) population. How would I get a DataFrame that only contains the largest
>> city in each country? Thanks!
>>
>>      Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>> Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Reply via email to