Re: High level explanation of dropDuplicates

Vladimir Prus Wed, 12 Jun 2019 02:06:36 -0700

Hi,

If your data frame is partitioned by column A, and you want deduplication
by columns A, B and C, then a faster way might be to sort each partition by
A, B and C and then do a linear scan - it is often faster than group by all
columns - which require a shuffle. Sadly, there's no standard way to do it.


One way to do it is via mapPartitions, but that involves serialisation
to/from Row. The best way is to write custom physical exec operator, but
it's not entirely trivial.

On Mon, 10 Jun 2019, 06:00 Rishi Shah, <rishishah.s...@gmail.com> wrote:

> Hi All,
>
> Just wanted to check back regarding best way to perform deduplication. Is
> using drop duplicates the optimal way to get rid of duplicates? Would it be
> better if we run operations on red directly?
>
> Also what about if we want to keep the last value of the group while
> performing deduplication (based on some sorting criteria)?
>
> Thanks,
> Rishi
>
> On Mon, May 20, 2019 at 3:33 PM Nicholas Hakobian <
> nicholas.hakob...@rallyhealth.com> wrote:
>
>> From doing some searching around in the spark codebase, I found the
>> following:
>>
>>
>> https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474
>>
>> So it appears there is no direct operation called dropDuplicates or
>> Deduplicate, but there is an optimizer rule that converts this logical
>> operation to a physical operation that is equivalent to grouping by all the
>> columns you want to deduplicate across (or all columns if you are doing
>> something like distinct), and taking the First() value. So (using a pySpark
>> code example):
>>
>> df = input_df.dropDuplicates(['col1', 'col2'])
>>
>> Is effectively shorthand for saying something like:
>>
>> df = input_df.groupBy('col1',
>> 'col2').agg(first(struct(input_df.columns)).alias('data')).select('data.*')
>>
>> Except I assume that it has some internal optimization so it doesn't need
>> to pack/unpack the column data, and just returns the whole Row.
>>
>> Nicholas Szandor Hakobian, Ph.D.
>> Principal Data Scientist
>> Rally Health
>> nicholas.hakob...@rallyhealth.com
>>
>>
>>
>> On Mon, May 20, 2019 at 11:38 AM Yeikel <em...@yeikel.com> wrote:
>>
>>> Hi ,
>>>
>>> I am looking for a high level explanation(overview) on how
>>> dropDuplicates[1]
>>> works.
>>>
>>> [1]
>>>
>>> https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326
>>>
>>> Could someone please explain?
>>>
>>> Thank you
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Regards,
>
> Rishi Shah
>

Re: High level explanation of dropDuplicates

Reply via email to