I think sb_edac.c (and probably other EDAC stuff) lacks PCI domain
support.  I notice messages like this:

  [   14.370256] pci 0000:ff:13.5: [8086:6fad] type 00 class 0x088000
  [   14.980481] pci 0000:bf:13.5: [8086:6fad] type 00 class 0x088000
  [   15.590646] pci 0000:7f:13.5: [8086:6fad] type 00 class 0x088000
  [   16.200498] pci 0000:3f:13.5: [8086:6fad] type 00 class 0x088000
  [   17.928243] pci 0001:ff:13.5: [8086:6fad] type 00 class 0x088000
  [   18.538876] pci 0001:bf:13.5: [8086:6fad] type 00 class 0x088000
  [   19.149211] pci 0001:7f:13.5: [8086:6fad] type 00 class 0x088000
  [   19.759431] pci 0001:3f:13.5: [8086:6fad] type 00 class 0x088000
  ...
  [   54.298058] EDAC sbridge: Duplicated device for 8086:6fad
  [   54.298062] EDAC sbridge: Failed to register device with error -19.

on a large system (see [1]).  It looks like sbridge_get_onedevice()
looks up things based on the PCI bus number, but it ignores the PCI
domain (aka segment) number, and I suspect it thinks 0000:ff:13.5 and
0001:ff:13.5 are duplicates.

  sbridge_get_all_devices
    while (...)
      do
        sbridge_get_onedevice
          pdev = pci_get_device(...)
          sbridge_dev = get_sbridge_dev(pdev->bus->number, ...)
          if (sbridge_dev->pdev[sbridge_dev->i_devs])
            printk("Duplicated device ...")
            return -ENODEV                   # -19
      while (pdev ...)

It looks like 88ae80aa609c ("EDAC, skx_edac: Handle systems with
segmented PCI busses") fixes a similar problem; maybe that should
be applied elsewhere in EDAC as well?

Why doesn't EDAC use the standard pci_register_driver() interface?
That would avoid issues like this.  It would also avoid the potential
conflict of another driver operating on the device at the same time.

[1] https://bugzilla.kernel.org/attachment.cgi?id=277759 (attachment
to unrelated bug https://bugzilla.kernel.org/show_bug.cgi?id=200765)

Reply via email to