On Thu, 26 Sep 2013, Sherry Hurwitz wrote:
> We have failed to reproduce a hang while loading microcode.

I got an offer from a Debian user to test it over the weekend, let's hope
he will have more luck(?) at hitting the issue.  If he does, it should give
us sysrq+t dumps of the hung system.

> We have tested with kernel and AMD family combinations with
> normal and error condition so error paths were taken.  Obviously
> there are factors we are missing that the users are hitting.

Yeah, and it is not likely to be a kernel patch, as the users hit the issue
using non-distro kernels :-(

Maybe it is on the firmware-loader side, but one user did wait 1 hour for
the thing to get unstuck, and that would have taken care of any possible
firmware-loader timeouts.

> Any suggestions on how we improve the test matrix would be
> helpful.  We will continue the investigation but any insights are appreciated.
> 
> NOTE: kernels before 3.0 only load 1 (2k) size of microcode patch and
> therefore do not support microcode loading of family 14h, 15h, and 16h.
> Also,in a test request on another thread you suggested someone with
> family 15h revC0 to load microcode twice with an earlier patch and then
> the latest, but there has only been 1 microcode patch level published for 
> revB2
> so that test won't work.

Well, it is the only thing I could think of, other than some nasty race
condition...

> kernel           cpu family             results             conditions
> ---------------------------------------------------------------------------------
> 2.6.38           fam10h                 load passed         normal
> 2.6.38           fam15h revC0           load failed         2.6.38 can not 
> handle 4k patches
> 3.5.2            fam10h                 load passed         normal
> 3.5.2            fam15h revB2           load passed         loaded 637 then 
> second load 63d
> 3.5.2            fam15h revC0           load passed         normal
> 3.5.2            fam15h revC0           load failed         used a corrupted 
> bin file

I just looked, and the 2.6.38 hang happened for i686 and an unindentified
3-core AMD processor, and the 3.5.2 on x86-64 PREEMPT, on a fam15h model 2
stepping 0, 32-core AMD processor (Linux 3.5.2 (SMP w/32 CPU cores;
PREEMPT)).  No patterns there.

BTW, the userspace script that users reported to have hung is this:

grep -q "^vendor_id[[:blank:]]*:[[:blank:]]*.*AuthenticAMD" /proc/cpuinfo && {
if modprobe -q --first-time microcode ; then
    echo "Updating microcode on all online processors..." >&2
else
    # we have to trigger the microcode update manually
    if [ -e /sys/devices/system/cpu/microcode/reload ] ; then
        echo "Updating microcode on all online processors..." >&2
        echo 1 > /sys/devices/system/cpu/microcode/reload || {
            echo "Kernel reported failure while updating microcode!" >&2
        }
    else
        # Try all online processors, broken kernels need this,
        # fixed kernels will accept it only on the BSP and update
        # all processors anyway, and -EINVAL all others... but we
        # don't know which one is the BSP, so we try all of them
        # and hide errors, the kernel will log any real problem.
        echo "Using per-core interface to update microcode on online 
processors..." >&2
        find /sys/devices/system/cpu -noleaf -type f -path 
'/sys/devices/system/cpu/cpu*/microcode/reload' | \
            while read i ; do echo -n 1 2>/dev/null >"$i" || true ; done
    fi
fi
}


With the microcode driver already loaded (so, that modprobe line fails).

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to