I get the same problem, but I'm running in a dev environment based on docker scripts. The additional issue is that the worker processes do not die and so the docker container does not exit. So I end up with worker containers that are not participating in the cluster.
On Fri, Jun 13, 2014 at 9:44 AM, Mayur Rustagi <mayur.rust...@gmail.com> wrote: > I have also had trouble in worker joining the working set. I have > typically moved to Mesos based setup. Frankly for high availability you are > better off using a cluster manager. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska <yana.kadiy...@gmail.com> > wrote: > >> Hi, I see this has been asked before but has not gotten any satisfactory >> answer so I'll try again: >> >> (here is the original thread I found: >> http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E >> ) >> >> I have a set of workers dying and coming back again. The master prints >> the following warning: >> >> "Got heartbeat from unregistered worker ...." >> >> What is the solution to this -- rolling the master is very undesirable to >> me as I have a Shark context sitting on top of it (it's meant to be highly >> available). >> >> Insights appreciated -- I don't think an executor going down is very >> unexpected but it does seem odd that it won't be able to rejoin the working >> set. >> >> I'm running Spark 0.9.1 on CDH >> >> >> >