I've been attempting to run a job based on MLlib's ALS implementation for a while now and have hit an issue I'm having a lot of difficulty getting to the bottom of.
On a moderate size set of input data it works fine, but against larger (still well short of what I'd think of as big) sets of data, I'll see one or two workers get stuck spinning at 100% CPU and the job unable to recover. I don't believe this is down to memory pressure as I seem to get the same behaviour at about the same size of input data, even if the cluster is twice as large. GC logs also suggest things are proceeding reasonably with some Full GC's occurring, but no suggestion of the process being GC locked. After rebooting the instance that got into trouble, I can see the stderr log for the task truncated in the middle of a log-line at the time CPU shoots to and sticks at 100%, but no other signs of a problem. I've run into the same issue on 1.1.0 and 1.2.0 in standalone mode and running on YARN. Any suggestions on further steps I could try to get a clearer diagnosis of the issue would be much appreciated. Thanks, Phil
