I did enable gc for both worker and supervisor and found nothing abnormal (pause is minimal and frequency is normal too). I tried max spound pending of both 1000 and 500.
Fang On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <[email protected]> wrote: > Hi Fang, > > Did you check your GC log? Do you see anything abnormal? > What is your current max spout pending setting? > > On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote: > >> I also did this and find no success. >> >> Thanks, >> Fang >> >> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]> wrote: >> >>> After I wrote that I realized you tried empty topology anyways. This >>> should reduce any gc or worker initialization related failures though they >>> are still possible. As Erik mentioned check ZK. Also I'm not sure if this >>> is still required but it used to be helpful to make sure your storm nodes >>> have each other listed in /etc/hosts. >>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote: >>> >>>> Make sure your topology is starting up in the allotted time, and if not >>>> try increasing the startup timeout. >>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote: >>>> >>>>> Hi Erik >>>>> >>>>> Thanks for your reply! It's great to hear about real production >>>>> usages. For our use case, we are really puzzled by the outcome so far. The >>>>> initial investigation seems to indicate that workers don't die by >>>>> themselves ( i actually tried killing the supervisor and the worker would >>>>> continue running beyond 30 minutes). >>>>> >>>>> The sequence of events is like this: supervisor immediately complains >>>>> worker "still has not started" for a few seconds right after launching the >>>>> worker process, then silent --> after 26 minutes, nimbus complains >>>>> executors (related to the worker) "not alive" and started to reassign >>>>> topology --> after another ~500 milliseconds, the supervisor shuts down >>>>> its >>>>> worker --> other peer workers complain about netty issues. and the loop >>>>> goes on. >>>>> >>>>> Could you kindly tell me what version of zookeeper is used with 0.9.4? >>>>> and how many nodes in the zookeeper cluster? >>>>> >>>>> I wonder if this is due to zookeeper issues. >>>>> >>>>> Thanks a lot, >>>>> Fang >>>>> >>>>> >>>>> >>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected] >>>>> > wrote: >>>>> >>>>>> Hey Fang, >>>>>> >>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and >>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+ >>>>>> nodes. >>>>>> >>>>>> One of the challenges with storm is figuring out what the root cause >>>>>> is when things go haywire. You'll wanna examine why the nimbus decided >>>>>> to >>>>>> restart your worker processes. It would happen when workers die and the >>>>>> nimbus notices that storm executors aren't alive. (There are logs in >>>>>> nimbus for this.) Then you'll wanna dig into why the workers died by >>>>>> looking at logs on the worker hosts. >>>>>> >>>>>> - Erik >>>>>> >>>>>> >>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>>>> >>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not >>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and >>>>>>> unfortunately we could not even have a clean run for over 30 minutes on >>>>>>> a >>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes >>>>>>> but on >>>>>>> different disks. >>>>>>> >>>>>>> I have huge troubles to give my data analytics topology a stable >>>>>>> run. So I tried the simplest topology I can think of, just an emtpy >>>>>>> bolt, >>>>>>> no io except for reading from kafka queue. >>>>>>> >>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt >>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping, >>>>>>> msg size=1k). >>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe >>>>>>> the topology is dead, then after another 2 minutes, another kill, then >>>>>>> another after another 4 minutes, and on and on. >>>>>>> >>>>>>> I can understand there might be issues in the coordination among >>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable >>>>>>> workarounds? I wish there are as so many of you are using it in >>>>>>> production >>>>>>> :-) >>>>>>> >>>>>>> I deeply appreciate any suggestions that could even make my toy >>>>>>> topology working! >>>>>>> >>>>>>> Fang >>>>>>> >>>>>>> >>>>> >> >
