Re: Removing duplicates from dataframe

Ted Yu Mon, 07 Dec 2015 10:21:07 -0800

bq. complete a shuffle stage due to lost executors

Have you taken a look at the log for the lost executor(s) ?


Which release of Spark are you using ?

Cheers

On Mon, Dec 7, 2015 at 10:12 AM, <ross.cramb...@thomsonreuters.com> wrote:

> I have pyspark app loading a large-ish (100GB) dataframe from JSON files
> and it turns out there are a number of duplicate JSON objects in the source
> data. I am trying to find the best way to remove these duplicates before
> using the dataframe.
>
> With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT
> *…’’’) the application is not able to complete a shuffle stage due to lost
> executors. Is there a more efficient way to remove these duplicate rows? If
> not, what settings can I tweak to help this succeed? I have tried both
> increasing and decreasing the number of default shuffle partitions (to 100
> and 500, respectively) but neither changes the behavior.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Removing duplicates from dataframe

Reply via email to