We use storm bare bones, not trident as it's too expensive for our use cases. The jvm options for supervisor is listed below but it might not be optimal in any sense.
supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6 -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000 -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false" On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <[email protected]> wrote: > Just to be sure, are you using Storm or Storm Trident? > Also can you share the current setting of your supervisor.child_opts? > > On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <[email protected]> wrote: > >> I did enable gc for both worker and supervisor and found nothing abnormal >> (pause is minimal and frequency is normal too). I tried max spound pending >> of both 1000 and 500. >> >> Fang >> >> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <[email protected]> >> wrote: >> >>> Hi Fang, >>> >>> Did you check your GC log? Do you see anything abnormal? >>> What is your current max spout pending setting? >>> >>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote: >>> >>>> I also did this and find no success. >>>> >>>> Thanks, >>>> Fang >>>> >>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]> >>>> wrote: >>>> >>>>> After I wrote that I realized you tried empty topology anyways. This >>>>> should reduce any gc or worker initialization related failures though they >>>>> are still possible. As Erik mentioned check ZK. Also I'm not sure if >>>>> this >>>>> is still required but it used to be helpful to make sure your storm nodes >>>>> have each other listed in /etc/hosts. >>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote: >>>>> >>>>>> Make sure your topology is starting up in the allotted time, and if >>>>>> not try increasing the startup timeout. >>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote: >>>>>> >>>>>>> Hi Erik >>>>>>> >>>>>>> Thanks for your reply! It's great to hear about real production >>>>>>> usages. For our use case, we are really puzzled by the outcome so far. >>>>>>> The >>>>>>> initial investigation seems to indicate that workers don't die by >>>>>>> themselves ( i actually tried killing the supervisor and the worker >>>>>>> would >>>>>>> continue running beyond 30 minutes). >>>>>>> >>>>>>> The sequence of events is like this: supervisor immediately >>>>>>> complains worker "still has not started" for a few seconds right after >>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus >>>>>>> complains executors (related to the worker) "not alive" and started to >>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor >>>>>>> shuts >>>>>>> down its worker --> other peer workers complain about netty issues. and >>>>>>> the >>>>>>> loop goes on. >>>>>>> >>>>>>> Could you kindly tell me what version of zookeeper is used with >>>>>>> 0.9.4? and how many nodes in the zookeeper cluster? >>>>>>> >>>>>>> I wonder if this is due to zookeeper issues. >>>>>>> >>>>>>> Thanks a lot, >>>>>>> Fang >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hey Fang, >>>>>>>> >>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and >>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+ >>>>>>>> nodes. >>>>>>>> >>>>>>>> One of the challenges with storm is figuring out what the root >>>>>>>> cause is when things go haywire. You'll wanna examine why the nimbus >>>>>>>> decided to restart your worker processes. It would happen when >>>>>>>> workers die >>>>>>>> and the nimbus notices that storm executors aren't alive. (There are >>>>>>>> logs >>>>>>>> in nimbus for this.) Then you'll wanna dig into why the workers died >>>>>>>> by >>>>>>>> looking at logs on the worker hosts. >>>>>>>> >>>>>>>> - Erik >>>>>>>> >>>>>>>> >>>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>>>>>> >>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not >>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), >>>>>>>>> and >>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes >>>>>>>>> on a >>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes >>>>>>>>> but on >>>>>>>>> different disks. >>>>>>>>> >>>>>>>>> I have huge troubles to give my data analytics topology a stable >>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy >>>>>>>>> bolt, >>>>>>>>> no io except for reading from kafka queue. >>>>>>>>> >>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt >>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field >>>>>>>>> grouping, >>>>>>>>> msg size=1k). >>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe >>>>>>>>> the topology is dead, then after another 2 minutes, another kill, then >>>>>>>>> another after another 4 minutes, and on and on. >>>>>>>>> >>>>>>>>> I can understand there might be issues in the coordination among >>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any >>>>>>>>> doable >>>>>>>>> workarounds? I wish there are as so many of you are using it in >>>>>>>>> production >>>>>>>>> :-) >>>>>>>>> >>>>>>>>> I deeply appreciate any suggestions that could even make my toy >>>>>>>>> topology working! >>>>>>>>> >>>>>>>>> Fang >>>>>>>>> >>>>>>>>> >>>>>>> >>>> >>> >> >
