Same here Ravi. See my post on a similar thread.
Are you running on YARN client?
On Aug 7, 2014 2:56 PM, "rpandya" <[email protected]> wrote:
> I'm running into a problem with executors failing, and it's not clear
> what's
> causing it. Any suggestions on how to diagnose & fix it would be
> appreciated.
>
> There are a variety of errors in the logs, and I don't see a consistent
> triggering error. I've tried varying the number of executors per machine
> (1/4/16 per 16-core/128GB machine w/200GB free disk) and it still fails.
>
> The relevant code is:
> val reads = fastqAsText.mapPartitionsWithIndex(runner.mapReads(_, _,
> seqDictBcast.value))
> val result = reads.coalesce(numMachines * coresPerMachine * 4,
> true).persist(StorageLevel.DISK_ONLY_2)
> log.info("SNAP output DebugString:\n" + result.toDebugString)
> log.info("produced " + result.count + " reads")
>
> The toDebugString output is:
> 2014-08-07 18:50:43 INFO SnapInputStage:198 - SNAP output DebugString:
> MappedRDD[10] at coalesce at SnapInputStage.scala:197 (640 partitions)
> CoalescedRDD[9] at coalesce at SnapInputStage.scala:197 (640 partitions)
> ShuffledRDD[8] at coalesce at SnapInputStage.scala:197 (640 partitions)
> MapPartitionsRDD[7] at coalesce at SnapInputStage.scala:197 (10
> partitions)
> MapPartitionsRDD[6] at mapPartitionsWithIndex at
> SnapInputStage.scala:195 (10 partitions)
> MappedRDD[4] at map at SnapInputStage.scala:188 (10 partitions)
> CoalescedRDD[3] at coalesce at SnapInputStage.scala:188 (10
> partitions)
> NewHadoopRDD[2] at newAPIHadoopFile at
> SnapInputStage.scala:182 (3003 partitions)
>
> The 10-partition stage works fine, takes about 1.4 hours, reads 40GB and
> writes 25GB per task. The next 640-partition stage is where the failures
> occur.
>
> Here are the first few errors from a recent run (sorted by time):
> work/hpcraviplvm10/app-20140807185713-0000/14/stderr: 14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvm10/app-20140807185713-0000/27/stderr: 14/08/07 20:32:18
> ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
> block(s)
> from ConnectionManagerId(hpcraviplvm1,49545)
> work/hpcraviplvm1/app-20140807185713-0000/9/stderr: 14/08/07 20:32:18
> ERROR
> ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvm2/app-20140807185713-0000/24/stderr: 14/08/07 20:32:18
> ERROR
> ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvm2/app-20140807185713-0000/36/stderr: 14/08/07 20:32:18
> ERROR
> SendingConnection: Exception while reading SendingConnection to
> ConnectionManagerId(hpcraviplvm1,49545)
> work/hpcraviplvma1/app-20140807185713-0000/26/stderr: 14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvma2/app-20140807185713-0000/15/stderr: 14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvma2/app-20140807185713-0000/18/stderr: 14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvma2/app-20140807185713-0000/23/stderr: 14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvma2/app-20140807185713-0000/33/stderr: 14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
>
> Thanks,
>
> Ravi Pandya
> Microsoft Research
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>