[AFMUG] Monitoring for the WIN

Dennis Burgess Fri, 21 May 2021 06:17:16 -0700

Just wanted to type up a small experience that I had with our network 
monitoring system.  We have a number of servers that do processing jobs.  These 
are 24 to 40 core Xeon servers that run Windows 10.  We did windows 1 bare 
metal as we wanted the OS to control the power consumption on them as they 
spent quite a bit of time waiting for jobs and not doing anything.


Well, starting last Tuesday, around 11am CST, my Zabbix installation started 
telling me one of our UPSs was in a overload state.  We have overload state set 
to 90%.  This is very odd for us as most of the time we run them under 50% as 
we have a few power switches that allows us to switch between UPSes for 
non-dual power supplied devices.   So I started to review my graphs:

[cid:image001.png@01D74E19.62E82950]

WTF is up wit this, the GREEN line is the current output percentage based on 
the UPS load.  Note I stated it started telling me around 11am on Tuesday that 
we were in a overload state, and I am like no clue what could be causing this.. 
Did someone plug in something?  Like a vacuum or heater and leave it on?  Why 
would anyone be in there?

So I reviewed the security cameras, no one has been in our DC for over 30 days. 
So not that.  Well I looked at our processing servers, and sure enough, one of 
the .EXEs had 45 copies running, each consuming 1.2% of CPU.  Then I looked at 
several other servers, guess what, same thing.....  Looked at our DB, the jobs 
completed, but the .exe did not close out.  WTF....  I looked at our dev team 
logs and sure enough they updated our processing server .exe right at 11:10am 
on Monday, looking at my logs, guess what started to go up then......

So I killed the .exe that was not closing out, and informed the devs that they 
need to clean up what they are doing.. Screenshot right after I killed the .exe 
on around 10 servers.

[cid:image002.png@01D74E19.62E82950]

So... This spike represents around 2000 watts of power usage that just DROPPED 
... That's 18 amps that we stopped pulling.  Quite a bit!!!  This is what 
Windows 10 power saving and CPU bursting etc., saves us!  Just on one UPS!   
Furthermore, us monitoring and setting triggers allowed us to identify an issue 
that we would normally have never known about.  We adjusted our CPU monitors on 
our processing servers as most of them were around 60-80% used, and adjusted 
them to trigger if the CPU is above 50% for over 10 min, vs 90% for over 5 
minutes.  As the latter never alarmed.



[LTI-Full_175px]
Dennis Burgess

Mikrotik : Trainer, Network Associate, Routing Engineer, Wireless Engineer, 
Traffic Control Engineer, Inter-Networking Engineer, Security Engineer, 
Enterprise Wireless Engineer
Hurricane Electric: IPv6 Sage Level
Cambium: ePMP

Author of "Learn RouterOS- Second Edition"
Link Technologies, Inc -- Mikrotik & WISP Support Services
Office: 314-735-0270  Website: 
http://www.linktechs.net<http://www.linktechs.net/>
Create Wireless Coverage's with www.towercoverage.com
Need MikroTik Cloud Management: https://cloud.linktechs.net
How did we do today?
[Gold 
Star]<https://app.customerthermometer.com/?template=log_feedback&hash=5badbac1&embed_data=dGVtcGVyYXR1cmVfaWQ9MSZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[Green
 
Light]<https://app.customerthermometer.com/?template=log_feedback&hash=675abe04&embed_data=dGVtcGVyYXR1cmVfaWQ9MiZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[Yellow
 
Light]<https://app.customerthermometer.com/?template=log_feedback&hash=e42b48a5&embed_data=dGVtcGVyYXR1cmVfaWQ9MyZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[Red
 
Light]<https://app.customerthermometer.com/?template=log_feedback&hash=ecaadcd3&embed_data=dGVtcGVyYXR1cmVfaWQ9NCZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>

-- 
AF mailing list
AF@af.afmug.com
http://af.afmug.com/mailman/listinfo/af_af.afmug.com

[AFMUG] Monitoring for the WIN

Reply via email to