On 6/18/19 1:39 AM, Alexey Kardashevskiy wrote:
On 18/06/2019 14:26, Shawn Anastasio wrote:
On 6/12/19 2:15 PM, Shawn Anastasio wrote:
On 6/12/19 2:07 AM, Alexey Kardashevskiy wrote:
On 12/06/2019 15:05, Shawn Anastasio wrote:
On 6/5/19 11:11 PM, Shawn Anastasio wrote:
On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
This is an attempt to allow DMA masks between 32..59 which are not
large
enough to use either a PHB3 bypass mode or a sketchy bypass.
Depending
on the max order, up to 40 is usually available.
This is based on v5.2-rc2.
Please comment. Thanks.
I have tested this patch set with an AMD GPU that's limited to <64bit
DMA (I believe it's 40 or 42 bit). It successfully allows the card to
operate without falling back to 32-bit DMA mode as it does without
the patches.
Relevant kernel log message:
```
[ 0.311211] pci 0033:01 : [PE# 00] Enabling 64-bit DMA bypass
```
Tested-by: Shawn Anastasio <sh...@anastas.io>
After a few days of further testing, I've started to run into stability
issues with the patch applied and used with an AMD GPU. Specifically,
the system sometimes spontaneously crashes. Not just EEH errors either,
the whole system shuts down in what looks like a checkstop.
Perhaps some subtle corruption is occurring?
Have you tried this?
https://patchwork.ozlabs.org/patch/1113506/
I have not. I'll give it a shot and try it out for a few days to see
if I'm able to reproduce the crashes.
A few days later and I was able to reproduce the checkstop while
watching a video in mpv. At this point the system had ~4 day
uptime and this wasn't the first video I watched during that time.
This is with https://patchwork.ozlabs.org/patch/1113506/ applied, too.
Any logs left? What was the reason for the checkstop and what is the
hardware? "lscpu" and "lspci -vv" for the starter would help. Thanks,
The machine is a Talos II with 2x 8 core DD2.2 Sforza modules.
I've added the output of lscpu and lspci below. As for logs,
it doesn't seem there are any kernel logs of the event.
The opal-gard utility shows some error records which I have
also included below.
opal-gard:
```
$ sudo ./opal-gard show 1
Record ID: 0x00000001
========================
Error ID: 0x9000000b
Error Type: Fatal (0xe3)
Path Type: physical
>Sys, Instance #0
>Node, Instance #0
>Proc, Instance #1
>EQ, Instance #0
>EX, Instance #0
$ sudo ./opal-gard show 2
Record ID: 0x00000002
========================
Error ID: 0x90000021
Error Type: Fatal (0xe3)
Path Type: physical
>Sys, Instance #0
>Node, Instance #0
>Proc, Instance #1
>EQ, Instance #2
>EX, Instance #1
```
lscpu:
```
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 52
On-line CPU(s) list: 0-3,8-31,36-47,52-63
Thread(s) per core: 4
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2154.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-3,8-31
NUMA node8 CPU(s): 36-47,52-63
```
lspci -vv:
Output at: https://upaste.anastas.io/IwVQzt