FWIW if you were using Xymon you would have been paged Monday at lunch time
that the load was crazy high on that box.  I would think Zabbix would
notify you that the load changed significantly as well, but I've never used
it.

When something has to chug data I get this:

[image: image.png]

Josh Luthman
24/7 Help Desk: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373


On Fri, May 21, 2021 at 9:17 AM Dennis Burgess <dmburg...@linktechs.net>
wrote:

> Just wanted to type up a small experience that I had with our network
> monitoring system.  We have a number of servers that do processing jobs.
> These are 24 to 40 core Xeon servers that run Windows 10.  We did windows 1
> bare metal as we wanted the OS to control the power consumption on them as
> they spent quite a bit of time waiting for jobs and not doing anything.
>
>
>
> Well, starting last Tuesday, around 11am CST, my Zabbix installation
> started telling me one of our UPSs was in a overload state.  We have
> overload state set to 90%.  This is very odd for us as most of the time we
> run them under 50% as we have a few power switches that allows us to switch
> between UPSes for non-dual power supplied devices.   So I started to review
> my graphs:
>
>
>
>
>
> WTF is up wit this, the GREEN line is the current output percentage based
> on the UPS load.  Note I stated it started telling me around 11am on
> Tuesday that we were in a overload state, and I am like no clue what could
> be causing this.. Did someone plug in something?  Like a vacuum or heater
> and leave it on?  Why would anyone be in there?
>
>
>
> So I reviewed the security cameras, no one has been in our DC for over 30
> days. So not that.  Well I looked at our processing servers, and sure
> enough, one of the .EXEs had 45 copies running, each consuming 1.2% of
> CPU.  Then I looked at several other servers, guess what, same thing…..
> Looked at our DB, the jobs completed, but the .exe did not close out.
> WTF….  I looked at our dev team logs and sure enough they updated our
> processing server .exe right at 11:10am on Monday, looking at my logs,
> guess what started to go up then……
>
>
>
> So I killed the .exe that was not closing out, and informed the devs that
> they need to clean up what they are doing.. Screenshot right after I killed
> the .exe on around 10 servers.
>
>
>
>
>
> So… This spike represents around 2000 watts of power usage that just
> DROPPED … That’s 18 amps that we stopped pulling.  Quite a bit!!!  This is
> what Windows 10 power saving and CPU bursting etc., saves us!  Just on one
> UPS!   Furthermore, us monitoring and setting triggers allowed us to
> identify an issue that we would normally have never known about.  We
> adjusted our CPU monitors on our processing servers as most of them were
> around 60-80% used, and adjusted them to trigger if the CPU is above 50%
> for over 10 min, vs 90% for over 5 minutes.  As the latter never alarmed.
>
>
>
>
>
>
>
> *[image: LTI-Full_175px]*
>
> *Dennis Burgess*
>
>
> * Mikrotik : **Trainer, Network Associate, Routing Engineer, Wireless
> Engineer, Traffic Control Engineer, Inter-Networking Engineer, Security
> Engineer, Enterprise Wireless Engineer*
>
> *Hurricane Electric: **IPv6 Sage Level*
>
> *Cambium: **ePMP*
>
>
>
> Author of "Learn RouterOS- Second Edition”
>
> *Link Technologies, Inc* -- Mikrotik & WISP Support Services
>
> *Office*: 314-735-0270  Website: http://www.linktechs.net
>
> Create Wireless Coverage’s with www.towercoverage.com
>
> Need MikroTik Cloud Management: https://cloud.linktechs.net
>
> *How did we do today?*
>
> [image: Gold Star]
> <https://app.customerthermometer.com/?template=log_feedback&hash=5badbac1&embed_data=dGVtcGVyYXR1cmVfaWQ9MSZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[image:
> Green Light]
> <https://app.customerthermometer.com/?template=log_feedback&hash=675abe04&embed_data=dGVtcGVyYXR1cmVfaWQ9MiZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[image:
> Yellow Light]
> <https://app.customerthermometer.com/?template=log_feedback&hash=e42b48a5&embed_data=dGVtcGVyYXR1cmVfaWQ9MyZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[image:
> Red Light]
> <https://app.customerthermometer.com/?template=log_feedback&hash=ecaadcd3&embed_data=dGVtcGVyYXR1cmVfaWQ9NCZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>
>
>
> --
> AF mailing list
> AF@af.afmug.com
> http://af.afmug.com/mailman/listinfo/af_af.afmug.com
>
-- 
AF mailing list
AF@af.afmug.com
http://af.afmug.com/mailman/listinfo/af_af.afmug.com

Reply via email to