We have been trying to find documentation on how to tell Xen to forward MCE 
information to the linux kernel in Dom0 in order to let a system administrator 
be able to get notified when his system has bad memory.  However from what I 
can tell this has not been documented anywhere.

If anyone knows of documentation (or knows the answer) of what someone is 
supposed to do in order to monitor the corrected errors and monitor the 
uncorrected errors when they are running modern xen, it would be appreciated.


To clarify, (and for people not familiar):

    When running old xen ( example: Xen 4.1) on a system, linux in dom0 would 
load the edac modules.  example: amd64_edac_mod , edac_mce_amd , and edac_core
    Once the modules loaded, the error counts for ECC memory, and PCI, could be 
found in these "files":
               /sys/devices/system/edac/mc/mc0/ce_count
               /sys/devices/system/edac/mc/mc0/ue_count
               /sys/devices/system/edac/pci/pci0/npe_count
               /sys/devices/system/edac/pci/pci0/pe_count

    However, in 2009-02, "cegger" wrote MCA/MCE_in_Xen, a proposal for having 
xen start checking the information
    Xen started accessing the EDAC information (now called "MCE") at some point 
after that, which blocks the linux kernel in dom0 from accessing it.
    (I also found what appears to be related sides from a presentation from 
2012 at: 
https://lkml.iu.edu/hypermail/linux/kernel/1206.3/01304/xen_vMCE_design_%28v0_2%29.pdf
 )

    Now, The linux kernel compile option: CONFIG_XEN_MCE_LOG=y is documented 
as: "Allow kernel fetching MCE error from Xen platform and converting it into 
Linux mcelog format for mcelog tools".
       I imagine there must be some way on the xen side for this to work for 
CONFIG_XEN_MCE_LOG to have gotten into the linux kernel and be enabled by 
default in distributions.
       (notes: mcelog seems to have been replaced with "ras daemon", but I 
believe that it pulls information using the same kernel APT as "mcelog") (so I 
believe the final output of if you are having memory errors is pulled by doing 
"ras-mc-ctl --errors" now instead of looking in /sys/devices/system/edac/mc and 
/sys/devices/system/edac/pci)
    I suspect that to check if it was working on a modern system, one would do 
"ras-mc-ctl --status" and get something implying that the xen mce interface is 
working instead of just saying "ras-mc-ctl: drivers not loaded."
    Somewhere it was said that adding the xen boot parameter "mce=1" to grub 
would cause xen to forward the info to the linux kernel, but that conflicts 
with recent changes to the documentation.  Also, tested by setting "mce=1" and 
nothing appears to change.


Any help is appreciated.

Reply via email to