Just to be sure, are you using Storm or Storm Trident? Also can you share the current setting of your supervisor.child_opts?
On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <[email protected]> wrote: > I did enable gc for both worker and supervisor and found nothing abnormal > (pause is minimal and frequency is normal too). I tried max spound pending > of both 1000 and 500. > > Fang > > On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <[email protected]> > wrote: > >> Hi Fang, >> >> Did you check your GC log? Do you see anything abnormal? >> What is your current max spout pending setting? >> >> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote: >> >>> I also did this and find no success. >>> >>> Thanks, >>> Fang >>> >>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]> wrote: >>> >>>> After I wrote that I realized you tried empty topology anyways. This >>>> should reduce any gc or worker initialization related failures though they >>>> are still possible. As Erik mentioned check ZK. Also I'm not sure if this >>>> is still required but it used to be helpful to make sure your storm nodes >>>> have each other listed in /etc/hosts. >>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote: >>>> >>>>> Make sure your topology is starting up in the allotted time, and if >>>>> not try increasing the startup timeout. >>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote: >>>>> >>>>>> Hi Erik >>>>>> >>>>>> Thanks for your reply! It's great to hear about real production >>>>>> usages. For our use case, we are really puzzled by the outcome so far. >>>>>> The >>>>>> initial investigation seems to indicate that workers don't die by >>>>>> themselves ( i actually tried killing the supervisor and the worker would >>>>>> continue running beyond 30 minutes). >>>>>> >>>>>> The sequence of events is like this: supervisor immediately >>>>>> complains worker "still has not started" for a few seconds right after >>>>>> launching the worker process, then silent --> after 26 minutes, nimbus >>>>>> complains executors (related to the worker) "not alive" and started to >>>>>> reassign topology --> after another ~500 milliseconds, the supervisor >>>>>> shuts >>>>>> down its worker --> other peer workers complain about netty issues. and >>>>>> the >>>>>> loop goes on. >>>>>> >>>>>> Could you kindly tell me what version of zookeeper is used with >>>>>> 0.9.4? and how many nodes in the zookeeper cluster? >>>>>> >>>>>> I wonder if this is due to zookeeper issues. >>>>>> >>>>>> Thanks a lot, >>>>>> Fang >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hey Fang, >>>>>>> >>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and >>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+ >>>>>>> nodes. >>>>>>> >>>>>>> One of the challenges with storm is figuring out what the root cause >>>>>>> is when things go haywire. You'll wanna examine why the nimbus decided >>>>>>> to >>>>>>> restart your worker processes. It would happen when workers die and the >>>>>>> nimbus notices that storm executors aren't alive. (There are logs in >>>>>>> nimbus for this.) Then you'll wanna dig into why the workers died by >>>>>>> looking at logs on the worker hosts. >>>>>>> >>>>>>> - Erik >>>>>>> >>>>>>> >>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>>>>> >>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not >>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and >>>>>>>> unfortunately we could not even have a clean run for over 30 minutes >>>>>>>> on a >>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes >>>>>>>> but on >>>>>>>> different disks. >>>>>>>> >>>>>>>> I have huge troubles to give my data analytics topology a stable >>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy >>>>>>>> bolt, >>>>>>>> no io except for reading from kafka queue. >>>>>>>> >>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt >>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field >>>>>>>> grouping, >>>>>>>> msg size=1k). >>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe >>>>>>>> the topology is dead, then after another 2 minutes, another kill, then >>>>>>>> another after another 4 minutes, and on and on. >>>>>>>> >>>>>>>> I can understand there might be issues in the coordination among >>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any >>>>>>>> doable >>>>>>>> workarounds? I wish there are as so many of you are using it in >>>>>>>> production >>>>>>>> :-) >>>>>>>> >>>>>>>> I deeply appreciate any suggestions that could even make my toy >>>>>>>> topology working! >>>>>>>> >>>>>>>> Fang >>>>>>>> >>>>>>>> >>>>>> >>> >> >
