Okay maybe these errors are more helpful -
WARN server.TransportChannelHandler: Exception in connection from
ip-10-0-0-138.ec2.internal/10.0.0.138:39723
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(
Here is the trace I get from the command line:
[Stage 4:> (60 + 60) /
200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnS
I have looked through the logs and do not see any WARNING or ERRORs - the
executors just seem to stop logging.
I am running Spark 1.5.2 on YARN.
On Dec 7, 2015, at 1:20 PM, Ted Yu
mailto:yuzhih...@gmail.com>> wrote:
bq. complete a shuffle stage due to lost executors
Have you taken a look at t
bq. complete a shuffle stage due to lost executors
Have you taken a look at the log for the lost executor(s) ?
Which release of Spark are you using ?
Cheers
On Mon, Dec 7, 2015 at 10:12 AM, wrote:
> I have pyspark app loading a large-ish (100GB) dataframe from JSON files
> and it turns out th
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it
turns out there are a number of duplicate JSON objects in the source data. I am
trying to find the best way to remove these duplicates before using the
dataframe.
With both df.dropDuplicates() and df.sqlContext.sql(