Synopsis: if sensors show missing data then reset the BMC unit before
rebooting the system to prevent unable to boot long beep issue.

I found a reliably reproducible workaround for this problem retaining
control continuity without the need to trip the mains breaker.  This
entirely prevents the long beep issue and allows the system to be used
in headless remote environments without ensuring remote mains power
cycle capability and/or remote hands intervention.

I have not had to disable the lm(4) sensor as advised previously for
the workaround and reached the conclusion this problem is not caused
by the driver itself in the first place, but by a buggy BMC firmware.

For this it is advisable to contact again the technical support at
Supermicro and ask them for a reliable BMC firmware update which does
not manifest the problem.

After running for a longer period (non specific or deterministic, above
30min), the sensors start to display wrong (missing) values and can not
provide data points to the BMC firmware.  This is seen both in IPMI
direct and networked access and in the web based management interface.
At this point, a reboot would get the system unable to boot manifesting
the dreaded long beep.  Only a power cycle of mains (power supply
breaker or power distribution unit) for a couple of seconds unblocks
the system and it is capable of successfully booting up again.  This
however totally undermines the remote control capabilities of the
system effectively turning it into a continuous source of remote
management manual reboot requests via intervention events for mains
power cycle (stop and start).

The workaround for this is to reset the BMC before attempting to reboot
the system, and it works over the network directly over IPMI and also
via the web based BMC interface likewise.  This only reboots the IPMI
controller (not the system) and its embedded firmware, then after a
couple of minutes the sensors poll actual correct data and display it
properly.  At this point a system reboot issued succeeds as expected and
everything the system boots up and works properly, until some non
specific longer time passes again (from 1h to days) and the BMC
controller gets stuck again (with a certainty it gets stuck) for which
the indication is missing sensors data and no reboot capability with
the long beep indication.

This is NOT OS specific unless the driver polling the sensors causes
the sensors sub-system in the embedded controller OS to crash, the only
factor affecting it so far is found to be the time running the system
without mains power cycle.  It is a flaw of the BMC firmware for which
the solution for sure is to demand an updated firmware from Supermicro
without this fault.  It would help if more people voice their concerns
over this so an updated BMC firmware is issued from Supermicro technical
support and published on their web site.

Here is how it looks when the BMC is stuck:

$ ipmi-sensor                                                                 
System Temp      | no reading        | ns
CPU Temp         | no reading        | ns
CPU FAN          | no reading        | ns
SYS FAN          | no reading        | ns
CPU Vcore        | no reading        | ns
Vichcore         | no reading        | ns
+3.3VCC          | no reading        | ns
VDIMM            | no reading        | ns
+5 V             | no reading        | ns
+12 V            | no reading        | ns
+3.3VSB          | no reading        | ns
VBAT             | no reading        | ns
Chassis Intru    | no reading        | ns
PS Status        | 0x00              | ok

$ ipmi-sensor-detail                                                          
System Temp      | na         |            | na    | na        | na        | na 
       | na        | na        | na        
CPU Temp         | na         |            | na    | na        | na        | na 
       | na        | na        | na        
CPU FAN          | na         |            | na    | na        | na        | na 
       | na        | na        | na        
SYS FAN          | na         |            | na    | na        | na        | na 
       | na        | na        | na        
CPU Vcore        | na         |            | na    | na        | na        | na 
       | na        | na        | na        
Vichcore         | na         |            | na    | na        | na        | na 
       | na        | na        | na        
+3.3VCC          | na         |            | na    | na        | na        | na 
       | na        | na        | na        
VDIMM            | na         |            | na    | na        | na        | na 
       | na        | na        | na        
+5 V             | na         |            | na    | na        | na        | na 
       | na        | na        | na        
+12 V            | na         |            | na    | na        | na        | na 
       | na        | na        | na        
+3.3VSB          | na         |            | na    | na        | na        | na 
       | na        | na        | na        
VBAT             | na         |            | na    | na        | na        | na 
       | na        | na        | na        
Chassis Intru    | na         | discrete   | na    | na        | na        | na 
       | na        | na        | na        
PS Status        | 0x0        | discrete   | 0x00ff| na        | na        | na 
       | na        | na        | na        

Here is how it looks after BMC reset:

$ ipmi-reset  
Sent cold reset command to MC

~75 seconds later:

$ ipmi-sensor 
System Temp      | 38 degrees C      | ok
CPU Temp         | 38 degrees C      | ok
CPU FAN          | no reading        | ns
SYS FAN          | no reading        | ns
CPU Vcore        | 1.10 Volts        | ok
Vichcore         | 1.04 Volts        | ok
+3.3VCC          | 3.31 Volts        | ok
VDIMM            | 1.53 Volts        | ok
+5 V             | 5.09 Volts        | ok
+12 V            | 12.03 Volts       | ok
+3.3VSB          | 3.28 Volts        | ok
VBAT             | 3.12 Volts        | ok
Chassis Intru    | 0x00              | ok
PS Status        | 0x00              | ok

$ ipmi-sensor-detail                                                            
                                                                                
       
System Temp      | 38.000     | degrees C  | ok    | -9.000    | -7.000    | 
-5.000    | 75.000    | 77.000    | 79.000    
CPU Temp         | 38.000     | degrees C  | ok    | -11.000   | -8.000    | 
-5.000    | 85.000    | 90.000    | 95.000    
CPU FAN          | na         |            | na    | na        | na        | na 
       | na        | na        | na        
SYS FAN          | na         |            | na    | na        | na        | na 
       | na        | na        | na        
CPU Vcore        | 1.096      | Volts      | ok    | 0.640     | 0.664     | 
0.688     | 1.344     | 1.408     | 1.472     
Vichcore         | 1.040      | Volts      | ok    | 0.808     | 0.824     | 
0.840     | 1.160     | 1.176     | 1.192     
+3.3VCC          | 3.312      | Volts      | ok    | 2.816     | 2.880     | 
2.944     | 3.584     | 3.648     | 3.712     
VDIMM            | 1.528      | Volts      | ok    | 1.312     | 1.328     | 
1.344     | 1.648     | 1.664     | 1.680     
+5 V             | 5.088      | Volts      | ok    | 4.096     | 4.320     | 
4.576     | 5.344     | 5.600     | 5.632     
+12 V            | 12.031     | Volts      | ok    | 10.706    | 10.600    | 
10.494    | 13.091    | 13.197    | 13.303    
+3.3VSB          | 3.280      | Volts      | ok    | 2.816     | 2.880     | 
2.944     | 3.584     | 3.648     | 3.712     
VBAT             | 3.120      | Volts      | ok    | 2.560     | 2.624     | 
2.688     | 3.328     | 3.392     | 3.456     
Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na 
       | na        | na        | na        
PS Status        | 0x0        | discrete   | 0x00ff| na        | na        | na 
       | na        | na        | na        

The main board with this specific workaround applicable is:

MBD-X7SPA-HF-D525-O

The main board was bought in May 2011 brand new in original packing
from official retailer carrying Supermicro products and uses memory
modules from the qualified vendor list.

http://www.supermicro.com/products/motherboard/ATOM/ICH9/X7SPA-HF-D525.cfm

The BMC and BIOS firmwares are the latest available from the Supermicro
web site:

Firmware Revision: 03.16
Firmware Build Time: 2014-06-30

Supermicro X7SPA/X7SPE/X7SPT Series BIOS Date:07/19/13 BIOS Rev:1.2b            

Hopefully this helps in further diagnostics and in the meantime as a
workaround to allow people with boards having the same problem to
operate them remotely until a BMC firmware is available fixing the
issue.

Regards,
Anton

Reply via email to