Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Devang Shah Fri, 19 Jun 2015 05:42:08 -0700

What's your max spout pending value for the topology ?

Also observe the CPU usage, like how many cycles it is spending on the
process.


Thanks and Regards,
Devang
On 19 Jun 2015 02:46, "Fang Chen" <[email protected]> wrote:

> tried. no effect.
>
> Thanks,
> Fang
>
> On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van <[email protected]>
> wrote:
>
>> Can you try this.
>>
>> Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so
>> that YGC happen once every 2-3 seconds?
>> If that fix the issue then I think GC is the cause of your problem.
>>
>> On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen <[email protected]> wrote:
>>
>>> We use storm bare bones, not trident as it's too expensive for our use
>>> cases.  The jvm options for supervisor is listed below but it might not be
>>> optimal in any sense.
>>>
>>> supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC
>>> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6
>>> -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly
>>> -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000
>>> -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions
>>> -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent
>>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
>>> -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure
>>> -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC
>>> -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log
>>> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
>>> -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998
>>> -Dcom.sun.management.jmxremote.ssl=false
>>> -Dcom.sun.management.jmxremote.authenticate=false"
>>>
>>>
>>>
>>> On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <[email protected]>
>>> wrote:
>>>
>>>> Just to be sure, are you using Storm or Storm Trident?
>>>> Also can you share the current setting of your supervisor.child_opts?
>>>>
>>>> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <[email protected]> wrote:
>>>>
>>>>> I did enable gc for both worker and supervisor and found nothing
>>>>> abnormal (pause is minimal and frequency is normal too).  I tried max
>>>>> spound pending of both 1000 and 500.
>>>>>
>>>>> Fang
>>>>>
>>>>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Fang,
>>>>>>
>>>>>> Did you check your GC log? Do you see anything abnormal?
>>>>>> What is your current max spout pending setting?
>>>>>>
>>>>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote:
>>>>>>
>>>>>>> I also did this and find no success.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Fang
>>>>>>>
>>>>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> After I wrote that I realized you tried empty topology anyways.
>>>>>>>> This should reduce any gc or worker initialization related failures 
>>>>>>>> though
>>>>>>>> they are still possible.  As Erik mentioned check ZK.  Also I'm not 
>>>>>>>> sure if
>>>>>>>> this is still required but it used to be helpful to make sure your 
>>>>>>>> storm
>>>>>>>> nodes have each other listed in /etc/hosts.
>>>>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Make sure your topology is starting up in the allotted time, and
>>>>>>>>> if not try increasing the startup timeout.
>>>>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Erik
>>>>>>>>>>
>>>>>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>>>>>> usages. For our use case, we are really puzzled by the outcome so 
>>>>>>>>>> far. The
>>>>>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>>>>>> themselves ( i actually tried killing the supervisor and the worker 
>>>>>>>>>> would
>>>>>>>>>> continue running beyond 30 minutes).
>>>>>>>>>>
>>>>>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>>>>>> complains worker "still has not started" for a few seconds right 
>>>>>>>>>> after
>>>>>>>>>> launching the worker process, then silent --> after 26 minutes, 
>>>>>>>>>> nimbus
>>>>>>>>>> complains executors (related to the worker) "not alive" and started 
>>>>>>>>>> to
>>>>>>>>>> reassign topology --> after another ~500 milliseconds, the 
>>>>>>>>>> supervisor shuts
>>>>>>>>>> down its worker --> other peer workers complain about netty issues. 
>>>>>>>>>> and the
>>>>>>>>>> loop goes on.
>>>>>>>>>>
>>>>>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>>>>>
>>>>>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>>>>>
>>>>>>>>>> Thanks a lot,
>>>>>>>>>> Fang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Fang,
>>>>>>>>>>>
>>>>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of
>>>>>>>>>>> 30+ nodes.
>>>>>>>>>>>
>>>>>>>>>>> One of the challenges with storm is figuring out what the root
>>>>>>>>>>> cause is when things go haywire.  You'll wanna examine why the 
>>>>>>>>>>> nimbus
>>>>>>>>>>> decided to restart your worker processes.  It would happen when 
>>>>>>>>>>> workers die
>>>>>>>>>>> and the nimbus notices that storm executors aren't alive.  (There 
>>>>>>>>>>> are logs
>>>>>>>>>>> in nimbus for this.)  Then you'll wanna dig into why the workers 
>>>>>>>>>>> died by
>>>>>>>>>>> looking at logs on the worker hosts.
>>>>>>>>>>>
>>>>>>>>>>> - Erik
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences 
>>>>>>>>>>>> there), and
>>>>>>>>>>>> unfortunately we could not even have a clean run for over 30 
>>>>>>>>>>>> minutes on a
>>>>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these 
>>>>>>>>>>>> nodes but on
>>>>>>>>>>>> different disks.
>>>>>>>>>>>>
>>>>>>>>>>>> I have huge troubles to give my data analytics topology a
>>>>>>>>>>>> stable run. So I tried the simplest topology I can think of, just 
>>>>>>>>>>>> an emtpy
>>>>>>>>>>>> bolt, no io except for reading from kafka queue.
>>>>>>>>>>>>
>>>>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field 
>>>>>>>>>>>> grouping,
>>>>>>>>>>>> msg size=1k).
>>>>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it
>>>>>>>>>>>> believe the topology is dead, then after another 2 minutes, 
>>>>>>>>>>>> another kill,
>>>>>>>>>>>> then another after another 4 minutes, and on and on.
>>>>>>>>>>>>
>>>>>>>>>>>> I can understand there might be issues in the coordination
>>>>>>>>>>>> among nimbus, worker and executor (e.g., heartbeats). But are 
>>>>>>>>>>>> there any
>>>>>>>>>>>> doable workarounds? I wish there are as so many of you are using 
>>>>>>>>>>>> it in
>>>>>>>>>>>> production :-)
>>>>>>>>>>>>
>>>>>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>>>>>> topology working!
>>>>>>>>>>>>
>>>>>>>>>>>> Fang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Reply via email to