On Sat, Feb 14, 2015 at 11:18:40AM +0800, Daniel J Blueman wrote:
> When ECC interrupts occur on memory controllers after EDAC_MAX_MCS (16), the

I knew this artificial limit would come back to bite us someday :-\

> kernel fatally dereferences unallocated structures [1]; this occurs on at
> least NumaConnect systems.
> 
> Minimally fix by checking if a memory controller info structure is allocated;
> candidate for stable.
> 
> Signed-off-by: Daniel J Blueman <dan...@numascale.com>
> 
> -- [1]
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
> IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
> PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
> Oops: 0000 [#2] SMP
> Modules linked in:
> CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G   D    3.19.0 #1

CPU 224?! What node is that? :)

> ---
>  drivers/edac/amd64_edac.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 17638d7..baccc0e 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -2175,7 +2175,7 @@ static void __log_bus_error(struct mem_ctl_info *mci, 
> struct err_info *err,
>  static inline void decode_bus_error(int node_id, struct mce *m)
>  {
>       struct mem_ctl_info *mci = mcis[node_id];
> -     struct amd64_pvt *pvt = mci->pvt_info;
> +     struct amd64_pvt *pvt;
>       u8 ecc_type = (m->status >> 45) & 0x3;
>       u8 xec = XEC(m->status, 0x1f);
>       u16 ec = EC(m->status);
> @@ -2190,6 +2190,11 @@ static inline void decode_bus_error(int node_id, 
> struct mce *m)
>       if (xec && xec != F10_NBSL_EXT_ERR_ECC)
>               return;
>  
> +     /* Unable to decode on memory controllers after EDAC_MAX_MCS, as no mci 
> is allocated */
> +     if (!mci)
> +             return;
> +     pvt = mci->pvt_info;

Hmm, so we have all the facilities to fix that properly, IINM:
edac_mc_find(), add_mc_to_global_list() and so on.

Would looking through the list of the memory controllers help instead,
i.e. if you do:

static inline void decode_bus_error(int node_id, struct mce *m)
{
        struct mem_ctl_info *mci = edac_mc_find(node_id);
        if (!mci)
                return;

?

Then we can get rid of that local mcis dumbness and do it properly...

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to