What's your max spout pending value for the topology ? Also observe the CPU usage, like how many cycles it is spending on the process.
Thanks and Regards, Devang On 19 Jun 2015 02:46, "Fang Chen" <[email protected]> wrote: > tried. no effect. > > Thanks, > Fang > > On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van <[email protected]> > wrote: > >> Can you try this. >> >> Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so >> that YGC happen once every 2-3 seconds? >> If that fix the issue then I think GC is the cause of your problem. >> >> On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen <[email protected]> wrote: >> >>> We use storm bare bones, not trident as it's too expensive for our use >>> cases. The jvm options for supervisor is listed below but it might not be >>> optimal in any sense. >>> >>> supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC >>> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6 >>> -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly >>> -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000 >>> -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions >>> -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent >>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution >>> -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure >>> -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC >>> -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log >>> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 >>> -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998 >>> -Dcom.sun.management.jmxremote.ssl=false >>> -Dcom.sun.management.jmxremote.authenticate=false" >>> >>> >>> >>> On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <[email protected]> >>> wrote: >>> >>>> Just to be sure, are you using Storm or Storm Trident? >>>> Also can you share the current setting of your supervisor.child_opts? >>>> >>>> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <[email protected]> wrote: >>>> >>>>> I did enable gc for both worker and supervisor and found nothing >>>>> abnormal (pause is minimal and frequency is normal too). I tried max >>>>> spound pending of both 1000 and 500. >>>>> >>>>> Fang >>>>> >>>>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Fang, >>>>>> >>>>>> Did you check your GC log? Do you see anything abnormal? >>>>>> What is your current max spout pending setting? >>>>>> >>>>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote: >>>>>> >>>>>>> I also did this and find no success. >>>>>>> >>>>>>> Thanks, >>>>>>> Fang >>>>>>> >>>>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> After I wrote that I realized you tried empty topology anyways. >>>>>>>> This should reduce any gc or worker initialization related failures >>>>>>>> though >>>>>>>> they are still possible. As Erik mentioned check ZK. Also I'm not >>>>>>>> sure if >>>>>>>> this is still required but it used to be helpful to make sure your >>>>>>>> storm >>>>>>>> nodes have each other listed in /etc/hosts. >>>>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote: >>>>>>>> >>>>>>>>> Make sure your topology is starting up in the allotted time, and >>>>>>>>> if not try increasing the startup timeout. >>>>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Erik >>>>>>>>>> >>>>>>>>>> Thanks for your reply! It's great to hear about real production >>>>>>>>>> usages. For our use case, we are really puzzled by the outcome so >>>>>>>>>> far. The >>>>>>>>>> initial investigation seems to indicate that workers don't die by >>>>>>>>>> themselves ( i actually tried killing the supervisor and the worker >>>>>>>>>> would >>>>>>>>>> continue running beyond 30 minutes). >>>>>>>>>> >>>>>>>>>> The sequence of events is like this: supervisor immediately >>>>>>>>>> complains worker "still has not started" for a few seconds right >>>>>>>>>> after >>>>>>>>>> launching the worker process, then silent --> after 26 minutes, >>>>>>>>>> nimbus >>>>>>>>>> complains executors (related to the worker) "not alive" and started >>>>>>>>>> to >>>>>>>>>> reassign topology --> after another ~500 milliseconds, the >>>>>>>>>> supervisor shuts >>>>>>>>>> down its worker --> other peer workers complain about netty issues. >>>>>>>>>> and the >>>>>>>>>> loop goes on. >>>>>>>>>> >>>>>>>>>> Could you kindly tell me what version of zookeeper is used with >>>>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster? >>>>>>>>>> >>>>>>>>>> I wonder if this is due to zookeeper issues. >>>>>>>>>> >>>>>>>>>> Thanks a lot, >>>>>>>>>> Fang >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hey Fang, >>>>>>>>>>> >>>>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and >>>>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of >>>>>>>>>>> 30+ nodes. >>>>>>>>>>> >>>>>>>>>>> One of the challenges with storm is figuring out what the root >>>>>>>>>>> cause is when things go haywire. You'll wanna examine why the >>>>>>>>>>> nimbus >>>>>>>>>>> decided to restart your worker processes. It would happen when >>>>>>>>>>> workers die >>>>>>>>>>> and the nimbus notices that storm executors aren't alive. (There >>>>>>>>>>> are logs >>>>>>>>>>> in nimbus for this.) Then you'll wanna dig into why the workers >>>>>>>>>>> died by >>>>>>>>>>> looking at logs on the worker hosts. >>>>>>>>>>> >>>>>>>>>>> - Erik >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not >>>>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences >>>>>>>>>>>> there), and >>>>>>>>>>>> unfortunately we could not even have a clean run for over 30 >>>>>>>>>>>> minutes on a >>>>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these >>>>>>>>>>>> nodes but on >>>>>>>>>>>> different disks. >>>>>>>>>>>> >>>>>>>>>>>> I have huge troubles to give my data analytics topology a >>>>>>>>>>>> stable run. So I tried the simplest topology I can think of, just >>>>>>>>>>>> an emtpy >>>>>>>>>>>> bolt, no io except for reading from kafka queue. >>>>>>>>>>>> >>>>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt >>>>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field >>>>>>>>>>>> grouping, >>>>>>>>>>>> msg size=1k). >>>>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it >>>>>>>>>>>> believe the topology is dead, then after another 2 minutes, >>>>>>>>>>>> another kill, >>>>>>>>>>>> then another after another 4 minutes, and on and on. >>>>>>>>>>>> >>>>>>>>>>>> I can understand there might be issues in the coordination >>>>>>>>>>>> among nimbus, worker and executor (e.g., heartbeats). But are >>>>>>>>>>>> there any >>>>>>>>>>>> doable workarounds? I wish there are as so many of you are using >>>>>>>>>>>> it in >>>>>>>>>>>> production :-) >>>>>>>>>>>> >>>>>>>>>>>> I deeply appreciate any suggestions that could even make my toy >>>>>>>>>>>> topology working! >>>>>>>>>>>> >>>>>>>>>>>> Fang >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
