Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Okay maybe these errors are more helpful - WARN server.TransportChannelHandler: Exception in connection from ip-10-0-0-138.ec2.internal/10.0.0.138:39723 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Here is the trace I get from the command line: [Stage 4:> (60 + 60) / 200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 10.0.0.138:33822 15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnS

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have looked through the logs and do not see any WARNING or ERRORs - the executors just seem to stop logging. I am running Spark 1.5.2 on YARN. On Dec 7, 2015, at 1:20 PM, Ted Yu mailto:yuzhih...@gmail.com>> wrote: bq. complete a shuffle stage due to lost executors Have you taken a look at t

Re: Removing duplicates from dataframe

2015-12-07 Thread Ted Yu
bq. complete a shuffle stage due to lost executors Have you taken a look at the log for the lost executor(s) ? Which release of Spark are you using ? Cheers On Mon, Dec 7, 2015 at 10:12 AM, wrote: > I have pyspark app loading a large-ish (100GB) dataframe from JSON files > and it turns out th

Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it turns out there are a number of duplicate JSON objects in the source data. I am trying to find the best way to remove these duplicates before using the dataframe. With both df.dropDuplicates() and df.sqlContext.sql(