FWIW if you were using Xymon you would have been paged Monday at lunch time that the load was crazy high on that box. I would think Zabbix would notify you that the load changed significantly as well, but I've never used it.
When something has to chug data I get this: [image: image.png] Josh Luthman 24/7 Help Desk: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373 On Fri, May 21, 2021 at 9:17 AM Dennis Burgess <dmburg...@linktechs.net> wrote: > Just wanted to type up a small experience that I had with our network > monitoring system. We have a number of servers that do processing jobs. > These are 24 to 40 core Xeon servers that run Windows 10. We did windows 1 > bare metal as we wanted the OS to control the power consumption on them as > they spent quite a bit of time waiting for jobs and not doing anything. > > > > Well, starting last Tuesday, around 11am CST, my Zabbix installation > started telling me one of our UPSs was in a overload state. We have > overload state set to 90%. This is very odd for us as most of the time we > run them under 50% as we have a few power switches that allows us to switch > between UPSes for non-dual power supplied devices. So I started to review > my graphs: > > > > > > WTF is up wit this, the GREEN line is the current output percentage based > on the UPS load. Note I stated it started telling me around 11am on > Tuesday that we were in a overload state, and I am like no clue what could > be causing this.. Did someone plug in something? Like a vacuum or heater > and leave it on? Why would anyone be in there? > > > > So I reviewed the security cameras, no one has been in our DC for over 30 > days. So not that. Well I looked at our processing servers, and sure > enough, one of the .EXEs had 45 copies running, each consuming 1.2% of > CPU. Then I looked at several other servers, guess what, same thing….. > Looked at our DB, the jobs completed, but the .exe did not close out. > WTF…. I looked at our dev team logs and sure enough they updated our > processing server .exe right at 11:10am on Monday, looking at my logs, > guess what started to go up then…… > > > > So I killed the .exe that was not closing out, and informed the devs that > they need to clean up what they are doing.. Screenshot right after I killed > the .exe on around 10 servers. > > > > > > So… This spike represents around 2000 watts of power usage that just > DROPPED … That’s 18 amps that we stopped pulling. Quite a bit!!! This is > what Windows 10 power saving and CPU bursting etc., saves us! Just on one > UPS! Furthermore, us monitoring and setting triggers allowed us to > identify an issue that we would normally have never known about. We > adjusted our CPU monitors on our processing servers as most of them were > around 60-80% used, and adjusted them to trigger if the CPU is above 50% > for over 10 min, vs 90% for over 5 minutes. As the latter never alarmed. > > > > > > > > *[image: LTI-Full_175px]* > > *Dennis Burgess* > > > * Mikrotik : **Trainer, Network Associate, Routing Engineer, Wireless > Engineer, Traffic Control Engineer, Inter-Networking Engineer, Security > Engineer, Enterprise Wireless Engineer* > > *Hurricane Electric: **IPv6 Sage Level* > > *Cambium: **ePMP* > > > > Author of "Learn RouterOS- Second Edition” > > *Link Technologies, Inc* -- Mikrotik & WISP Support Services > > *Office*: 314-735-0270 Website: http://www.linktechs.net > > Create Wireless Coverage’s with www.towercoverage.com > > Need MikroTik Cloud Management: https://cloud.linktechs.net > > *How did we do today?* > > [image: Gold Star] > <https://app.customerthermometer.com/?template=log_feedback&hash=5badbac1&embed_data=dGVtcGVyYXR1cmVfaWQ9MSZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[image: > Green Light] > <https://app.customerthermometer.com/?template=log_feedback&hash=675abe04&embed_data=dGVtcGVyYXR1cmVfaWQ9MiZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[image: > Yellow Light] > <https://app.customerthermometer.com/?template=log_feedback&hash=e42b48a5&embed_data=dGVtcGVyYXR1cmVfaWQ9MyZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[image: > Red Light] > <https://app.customerthermometer.com/?template=log_feedback&hash=ecaadcd3&embed_data=dGVtcGVyYXR1cmVfaWQ9NCZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=> > > > -- > AF mailing list > AF@af.afmug.com > http://af.afmug.com/mailman/listinfo/af_af.afmug.com >
-- AF mailing list AF@af.afmug.com http://af.afmug.com/mailman/listinfo/af_af.afmug.com