I would just map to pair using the id. Then do a reduceByKey where you compare
the scores and keep the highest. Then do .values and that should do it.
Sent from my iPhone
> On Jan 11, 2020, at 11:14 AM, Rishi Shah wrote:
>
>
> Thanks everyone for your contribution on this topic, I wanted to
Thanks everyone for your contribution on this topic, I wanted to check-in
to see if anyone has discovered a different or have an opinion on better
approach to deduplicating data using pyspark. Would really appreciate any
further insight on this.
Thanks,
-Rishi
On Wed, Jun 12, 2019 at 4:21 PM Yeik
Nicholas , thank you for your explanation.
I am also interested in the example that Rishi is asking for. I am sure
mapPartitions may work , but as Vladimir suggests it may not be the best
option in terms of performance.
@Vladimir Prus , are you aware of any example about writing a "custom
phy
Hi,
If your data frame is partitioned by column A, and you want deduplication
by columns A, B and C, then a faster way might be to sort each partition by
A, B and C and then do a linear scan - it is often faster than group by all
columns - which require a shuffle. Sadly, there's no standard way to
Hi All,
Just wanted to check back regarding best way to perform deduplication. Is
using drop duplicates the optimal way to get rid of duplicates? Would it be
better if we run operations on red directly?
Also what about if we want to keep the last value of the group while
performing deduplication
>From doing some searching around in the spark codebase, I found the
following:
https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474
So it appears there is no direct operation
Hi ,
I am looking for a high level explanation(overview) on how dropDuplicates[1]
works.
[1]
https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326
Could someone please explain?
Thank you
--
Sent from: