Just wanted to type up a small experience that I had with our network monitoring system. We have a number of servers that do processing jobs. These are 24 to 40 core Xeon servers that run Windows 10. We did windows 1 bare metal as we wanted the OS to control the power consumption on them as they spent quite a bit of time waiting for jobs and not doing anything.
Well, starting last Tuesday, around 11am CST, my Zabbix installation started telling me one of our UPSs was in a overload state. We have overload state set to 90%. This is very odd for us as most of the time we run them under 50% as we have a few power switches that allows us to switch between UPSes for non-dual power supplied devices. So I started to review my graphs: [cid:image001.png@01D74E19.62E82950] WTF is up wit this, the GREEN line is the current output percentage based on the UPS load. Note I stated it started telling me around 11am on Tuesday that we were in a overload state, and I am like no clue what could be causing this.. Did someone plug in something? Like a vacuum or heater and leave it on? Why would anyone be in there? So I reviewed the security cameras, no one has been in our DC for over 30 days. So not that. Well I looked at our processing servers, and sure enough, one of the .EXEs had 45 copies running, each consuming 1.2% of CPU. Then I looked at several other servers, guess what, same thing..... Looked at our DB, the jobs completed, but the .exe did not close out. WTF.... I looked at our dev team logs and sure enough they updated our processing server .exe right at 11:10am on Monday, looking at my logs, guess what started to go up then...... So I killed the .exe that was not closing out, and informed the devs that they need to clean up what they are doing.. Screenshot right after I killed the .exe on around 10 servers. [cid:image002.png@01D74E19.62E82950] So... This spike represents around 2000 watts of power usage that just DROPPED ... That's 18 amps that we stopped pulling. Quite a bit!!! This is what Windows 10 power saving and CPU bursting etc., saves us! Just on one UPS! Furthermore, us monitoring and setting triggers allowed us to identify an issue that we would normally have never known about. We adjusted our CPU monitors on our processing servers as most of them were around 60-80% used, and adjusted them to trigger if the CPU is above 50% for over 10 min, vs 90% for over 5 minutes. As the latter never alarmed. [LTI-Full_175px] Dennis Burgess Mikrotik : Trainer, Network Associate, Routing Engineer, Wireless Engineer, Traffic Control Engineer, Inter-Networking Engineer, Security Engineer, Enterprise Wireless Engineer Hurricane Electric: IPv6 Sage Level Cambium: ePMP Author of "Learn RouterOS- Second Edition" Link Technologies, Inc -- Mikrotik & WISP Support Services Office: 314-735-0270 Website: http://www.linktechs.net<http://www.linktechs.net/> Create Wireless Coverage's with www.towercoverage.com Need MikroTik Cloud Management: https://cloud.linktechs.net How did we do today? [Gold Star]<https://app.customerthermometer.com/?template=log_feedback&hash=5badbac1&embed_data=dGVtcGVyYXR1cmVfaWQ9MSZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[Green Light]<https://app.customerthermometer.com/?template=log_feedback&hash=675abe04&embed_data=dGVtcGVyYXR1cmVfaWQ9MiZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[Yellow Light]<https://app.customerthermometer.com/?template=log_feedback&hash=e42b48a5&embed_data=dGVtcGVyYXR1cmVfaWQ9MyZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>[Red Light]<https://app.customerthermometer.com/?template=log_feedback&hash=ecaadcd3&embed_data=dGVtcGVyYXR1cmVfaWQ9NCZ0aGVybW9tZXRlcl9pZD0xMTM1NjYmbnBzX3JhdGluZz0tMQ==&e=Anonymous&f=Dennis&l=Burgess&c=&c1=&c2=&c3=&c4=&c5=&c6=&c7=&c8=&c9=&c10=>
-- AF mailing list AF@af.afmug.com http://af.afmug.com/mailman/listinfo/af_af.afmug.com