I've been attempting to run a job based on MLlib's ALS implementation for a
while now and have hit an issue I'm having a lot of difficulty getting to
the bottom of.

On a moderate size set of input data it works fine, but against larger
(still well short of what I'd think of as big) sets of data, I'll see one
or two workers get stuck spinning at 100% CPU and the job unable to
recover.

I don't believe this is down to memory pressure as I seem to get the same
behaviour at about the same size of input data, even if  the cluster is
twice as large. GC logs also suggest things are proceeding reasonably with
some Full GC's occurring, but no suggestion of the process being GC locked.

After rebooting the instance that got into trouble, I can see the stderr
log for the task truncated in the middle of a log-line at the time CPU
shoots to and sticks at 100%, but no other signs of a problem.

I've run into the same issue on 1.1.0 and 1.2.0 in standalone mode and
running on YARN.

Any suggestions on further steps I could try to get a clearer diagnosis of
the issue would be much appreciated.

Thanks,

Phil

Reply via email to