On Thursday, 6 June 2019 10:07:54 PM AEST Oliver wrote: > On Thu, Jun 6, 2019 at 5:17 PM Alistair Popple <alist...@popple.id.au> wrote: > > I have been hitting EEH address errors testing this with some network > > cards which map/unmap DMA addresses more frequently. For example: > > > > PHB4 PHB#5 Diag-data (Version: 1) > > brdgCtl: 00000002 > > RootSts: 00060020 00402000 a0220008 00100107 00000800 > > PhbSts: 0000001c00000000 0000001c00000000 > > Lem: 0000000100000080 0000000000000000 0000000000000080 > > PhbErr: 0000028000000000 0000020000000000 2148000098000240 > > a008400000000000 RxeTceErr: 2000000000000000 2000000000000000 > > c000000000000000 0000000000000000 PblErr: 0000000000020000 > > 0000000000020000 0000000000000000 0000000000000000 RegbErr: > > 0000004000000000 0000004000000000 61000c4800000000 0000000000000000 > > PE[000] A/B: 8300b03800000000 8000000000000000 > > > > Interestingly the PE[000] A/B data is the same across different cards > > and drivers. > > TCE page fault due to permissions so odds are the DMA address was unmapped. > > What cards did you get this with? I tried with one of the common > BCM5719 NICs and generated network traffic by using rsync to copy a > linux git tree to the system and it worked fine.
Personally I've seen it with the BCM5719 with the driver modified to set a DMA mask of 48 bits instead of 64 and using scp to copy a random 1GB file to the system repeatedly until it crashes. I have also had reports of someone hitting the same error using a Mellanox CX-5 adaptor with a similar driver modification. - Alistair