Hi, If your data frame is partitioned by column A, and you want deduplication by columns A, B and C, then a faster way might be to sort each partition by A, B and C and then do a linear scan - it is often faster than group by all columns - which require a shuffle. Sadly, there's no standard way to do it.
One way to do it is via mapPartitions, but that involves serialisation to/from Row. The best way is to write custom physical exec operator, but it's not entirely trivial. On Mon, 10 Jun 2019, 06:00 Rishi Shah, <rishishah.s...@gmail.com> wrote: > Hi All, > > Just wanted to check back regarding best way to perform deduplication. Is > using drop duplicates the optimal way to get rid of duplicates? Would it be > better if we run operations on red directly? > > Also what about if we want to keep the last value of the group while > performing deduplication (based on some sorting criteria)? > > Thanks, > Rishi > > On Mon, May 20, 2019 at 3:33 PM Nicholas Hakobian < > nicholas.hakob...@rallyhealth.com> wrote: > >> From doing some searching around in the spark codebase, I found the >> following: >> >> >> https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474 >> >> So it appears there is no direct operation called dropDuplicates or >> Deduplicate, but there is an optimizer rule that converts this logical >> operation to a physical operation that is equivalent to grouping by all the >> columns you want to deduplicate across (or all columns if you are doing >> something like distinct), and taking the First() value. So (using a pySpark >> code example): >> >> df = input_df.dropDuplicates(['col1', 'col2']) >> >> Is effectively shorthand for saying something like: >> >> df = input_df.groupBy('col1', >> 'col2').agg(first(struct(input_df.columns)).alias('data')).select('data.*') >> >> Except I assume that it has some internal optimization so it doesn't need >> to pack/unpack the column data, and just returns the whole Row. >> >> Nicholas Szandor Hakobian, Ph.D. >> Principal Data Scientist >> Rally Health >> nicholas.hakob...@rallyhealth.com >> >> >> >> On Mon, May 20, 2019 at 11:38 AM Yeikel <em...@yeikel.com> wrote: >> >>> Hi , >>> >>> I am looking for a high level explanation(overview) on how >>> dropDuplicates[1] >>> works. >>> >>> [1] >>> >>> https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326 >>> >>> Could someone please explain? >>> >>> Thank you >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> > > -- > Regards, > > Rishi Shah >