Hi Mike,

Do you have access to your YARN NodeManager logs?  When executors die
randomly on YARN, it's often because they use more memory than allowed for
their YARN container.  You would see messages to the effect of "container
killed because physical memory limits exceeded".

-Sandy

On Wed, Oct 1, 2014 at 8:46 PM, Xiangrui Meng <[email protected]> wrote:

> The cost depends on the feature dimension, number of instances, number
> of classes, and number of partitions. Do you mind sharing those
> numbers? -Xiangrui
>
> On Wed, Oct 1, 2014 at 6:31 PM, Mike Bernico <[email protected]>
> wrote:
> > Hi Everyone,
> >
> > I'm working on training mllib's Naive Bayes to classify TF/IDF vectoried
> > docs using Spark 1.1.0.
> >
> > I've gotten this to work fine on a smaller set of data, but when I
> increase
> > the number of vectorized documents  I get hung up on training.  The only
> > messages I'm seeing are below.  I'm pretty new to spark and I don't
> really
> > know where to go next to troubleshoot this.
> >
> > I'm running spark in yarn like this:
> > spark-shell --master yarn-client --executor-memory 7G --driver memory 7G
> > --num-executors 3
> >
> > I have three workers, each with 64G of ram and 8 cores.
> >
> >
> >
> > scala> val model = NaiveBayes.train(training, lambda = 1.0)
> > 14/10/01 19:40:34 ERROR YarnClientClusterScheduler: Lost executor 2 on
> > rpl0000001273.<removed>: remote Akka client disassociated
> > 14/10/01 19:40:34 WARN TaskSetManager: Lost task 195.0 in stage 5.0 (TID
> > 2940, rpl0000001273.<removed>): ExecutorLostFailure (executor lost)
> > 14/10/01 19:40:34 WARN TaskSetManager: Lost task 190.0 in stage 5.0 (TID
> > 2782, rpl0000001272.<removed>): FetchFailed(BlockManagerId(2,
> > rpl0000001273.<removed>, 57359, 0), shuffleId=1, mapId=0, reduceId=190)
> > 14/10/01 19:40:35 WARN TaskSetManager: Lost task 195.1 in stage 5.0 (TID
> > 2941, rpl0000001272.<removed>): FetchFailed(BlockManagerId(2,
> > rpl0000001273.<removed>, 57359, 0), shuffleId=1, mapId=0, reduceId=195)
> > 14/10/01 19:40:36 WARN TaskSetManager: Lost task 185.0 in stage 5.0 (TID
> > 2780, rpl0000001277.<removed>): FetchFailed(BlockManagerId(2,
> > rpl0000001273.<removed>, 57359, 0), shuffleId=1, mapId=0, reduceId=185)
> > 14/10/01 19:46:24 ERROR YarnClientClusterScheduler: Lost executor 1 on
> > rpl0000001272.<removed>: remote Akka client disassociated
> > 14/10/01 19:46:24 WARN TaskSetManager: Lost task 78.0 in stage 5.1 (TID
> > 3377, rpl0000001272.<removed>): ExecutorLostFailure (executor lost)
> > 14/10/01 19:46:25 WARN TaskSetManager: Lost task 79.0 in stage 5.1 (TID
> > 3378, rpl0000001273.<removed>): FetchFailed(BlockManagerId(1,
> > rpl0000001272.<removed>, 60926, 0), shuffleId=1, mapId=5, reduceId=220)
> > 14/10/01 19:46:25 WARN TaskSetManager: Lost task 78.1 in stage 5.1 (TID
> > 3379, rpl0000001273.<removed>): FetchFailed(BlockManagerId(1,
> > rpl0000001272.<removed>, 60926, 0), shuffleId=1, mapId=5, reduceId=215)
> > 14/10/01 19:46:29 WARN TaskSetManager: Lost task 73.0 in stage 5.1 (TID
> > 3372, rpl0000001277.<removed>): FetchFailed(BlockManagerId(1,
> > rpl0000001272.<removed>, 60926, 0), shuffleId=1, mapId=9, reduceId=210)
> > 14/10/01 19:57:27 ERROR YarnClientClusterScheduler: Lost executor 3 on
> > rpl0000001277.<removed>: remote Akka client disassociated
> > 14/10/01 19:57:27 WARN TaskSetManager: Lost task 177.0 in stage 5.2 (TID
> > 4015, rpl0000001277.<removed>): ExecutorLostFailure (executor lost)
> > 14/10/01 19:57:27 ERROR ConnectionManager: Corresponding
> SendingConnection
> > to ConnectionManagerId(rpl0000001277.<removed>,41425) not found
> > 14/10/01 19:57:30 WARN TaskSetManager: Lost task 182.0 in stage 5.2 (TID
> > 4020, rpl0000001272.<removed>): FetchFailed(BlockManagerId(3,
> > rpl0000001277.<removed>, 41425, 0), shuffleId=1, mapId=2, reduceId=340)
> > 14/10/01 19:57:30 WARN TaskSetManager: Lost task 177.1 in stage 5.2 (TID
> > 4022, rpl0000001272.<removed>): FetchFailed(BlockManagerId(3,
> > rpl0000001277.<removed>, 41425, 0), shuffleId=1, mapId=2, reduceId=335)
> > 14/10/01 19:57:36 WARN TaskSetManager: Lost task 183.0 in stage 5.2 (TID
> > 4021, rpl0000001273.<removed>): FetchFailed(BlockManagerId(3,
> > rpl0000001277.<removed>, 41425, 0), shuffleId=1, mapId=8, reduceId=345)
> > 14/10/01 20:20:22 ERROR YarnClientClusterScheduler: Lost executor 4 on
> > rpl0000001273.<removed>: remote Akka client disassociated
> > 14/10/01 20:20:22 WARN TaskSetManager: Lost task 527.0 in stage 5.3 (TID
> > 5159, rpl0000001273.<removed>): ExecutorLostFailure (executor lost)
> > 14/10/01 20:20:23 WARN TaskSetManager: Lost task 517.0 in stage 5.3 (TID
> > 5149, rpl0000001272.<removed>): FetchFailed(BlockManagerId(4,
> > rpl0000001273.<removed>, 51049, 0), shuffleId=1, mapId=6, reduceId=690)
> > 14/10/01 20:20:23 WARN TaskSetManager: Lost task 527.1 in stage 5.3 (TID
> > 5160, rpl0000001272.<removed>): FetchFailed(BlockManagerId(4,
> > rpl0000001273.<removed>, 51049, 0), shuffleId=1, mapId=5, reduceId=700)
> > 14/10/01 20:20:25 WARN TaskSetManager: Lost task 522.0 in stage 5.3 (TID
> > 5154, rpl0000001277.<removed>): FetchFailed(BlockManagerId(4,
> > rpl0000001273.<removed>, 51049, 0), shuffleId=1, mapId=5, reduceId=695)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to