[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-29 Thread che...@lemote.com
> On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>> ? 2012?2?17? ??5:27?Chen Jie  ???
>> >> One good way to test gart is to go over GPU gart table and write a
>> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>> >> or somevalue that is unlikely to be already set. And then go over
>> >> all the page and check that GPU write succeed. Abusing the scratch
>> >> register write back feature is the easiest way to try that.
>> > I'm planning to add a GART table check procedure when resume, which
>> > will go over GPU gart table:
>> > 1. read(backup) a dword at end of each GPU page
>> > 2. write a mark by GPU and check it
>> > 3. restore the original dword
>> Attachment validateGART.patch do the job:
>> * It current only works for mips64 platform.
>> * To use it, apply all_in_vram.patch first, which will allocate CP
>> ring, ih, ib in VRAM and hard code no_wb=1.
>>
>> The gart test routine will be invoked in r600_resume. We've tried it,
>> and find that when lockup happened the gart table was good before
>> userspace restarting. The related dmesg follows:
>> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>> at 90004004, 32768 entries, Dummy
>> Page[0x0e004000-0x0e007fff]
>> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>> entries(valid=8544, invalid=24224, total=32768).
>> ...
>> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>> [ 1532.152343] Restarting tasks ... done.
>> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
>> 10003msec
>> [ 1544.472656] [ cut here ]
>> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>> radeon_fence_wait+0x25c/0x314()
>> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>> 0x0002136A)
>> ...
>> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
>> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
>> [ 1545.062500] radeon :01:05.0: WB disabled
>> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>> [ 1545.109375] [drm] Enabling audio support
>> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>> at 90004004, 32768 entries, Dummy
>> Page[0x0e004000-0x0e007fff]
>> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>> entry=0x0e008067, orignal=0x745aaad1
>> ...
>> /* System blocked here. */
>>
>> Any idea?
>
> I know lockup are frustrating, my only idea is the memory controller
> is lockup because of some failing pci <-> system ram transaction.
>
>>
>> BTW, we find the following in r600_pcie_gart_enable()
>> (drivers/gpu/drm/radeon/r600.c):
>> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>> (u32)(rdev->dummy_page.addr >> 12));
>>
>> On our platform, PAGE_SIZE is 16K, does it have any problem?
>
> No this should be handled properly.
>
>> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>> should change to:
>>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>>   radeon_gart_set_page(rdev, t, page_base);
>> - page_base += RADEON_GPU_PAGE_SIZE;
>> + if (page_base != rdev->dummy_page.addr)
>> + page_base += RADEON_GPU_PAGE_SIZE;
>>   }
>> ???
>
> No need to do so, dummy page will be 16K too, so it's fine.
Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
is 0x8e004000, then there are four types of address in GART:0x8e004000,
0x8e005000, 0x8e006000, 0x8e007000. The value which written in
VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.

>
> Cheers,
> Jerome
>
>

Huacai Chen



[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-02-29 Thread che...@lemote.com
> On Mon, 2012-02-27 at 10:44 +0800, Chen Jie wrote:
>> Hi,
>>
>> For this occasional GPU lockup when returns from STR/STD, I find
>> followings(when the problem happens):
>>
>> The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
>> Which means:
>> * HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
>> * MCDW_BUSY(Memory Controller Block is Busy)
>> * BIF_BUSY(Bus Interface is Busy)
>> * MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
>> Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
>> relationship among GART mapped memory, On-board video memory and MCDX,
>> MCDW?
>>
>> CP_STAT: the CSF_RING_BUSY is always set.
>
> Once the memory controller fails to do a pci transaction the CP
> will be stuck. At least if ring is in system memory, if ring is
> in vram CP might be stuck too because anyway everything goes
> through the MC.
>
I've tried the method of rs600 for gpu reset (use rs600_bm_disable() to
disable PCI MASTER bit and enable it after reset), but it doesn't solve
the problem. Then I found that in r100_bm_disable() it do more things,
e.g. writing the GPU register R_30_BUS_CNTL. In r600_reg.h there is
a register R600_BUS_CNTL, does this register have a similar function?
 But I don't know how to use it...

Huacai Chen

>>
>> There are many CP_PACKET2(0x8000) in CP ring(more than three
>> hundreds). e.g.
>> r[131800]=0x00028000
>> r[131801]=0xc0016800
>> r[131802]=0x0140
>> r[131803]=0x79c5
>> r[131804]=0x304a
>> r[131805] ... r[132143]=0x8000
>> r[132144]=0x
>> After the first reset, GPU will lockup again, this time, typically
>> there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
>> in the end.
>> Are these normal?
>>
>> BTW, is there any way for X to switch to NOACCEL mode when the problem
>> happens? Thus users will have a chance to save their documents and
>> then reboot machine.
>
> I have been meaning to patch the ddx to fallback to sw after GPU lockup.
> But this is useless in today world, where everything is composited ie
> the screen is updated using the 3D driver for which there is no easy
> way to suddenly migrate to  software rendering. I will still probably
> do the ddx patch at one point.
>
> Cheers,
> Jerome
>
>




[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-07 Thread che...@lemote.com
When "MC timeout" happens at GPU reset, we found the 12th and 13th
bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
two bits are like this:
#define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
#define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)

Could you please tell me what does they mean? And if possible,
I want to know the functionalities of these 5 registers in detail:
#define R_000E60_SRBM_SOFT_RESET   0x0E60
#define R_000E50_SRBM_STATUS   0x0E50
#define R_008020_GRBM_SOFT_RESET0x8020
#define R_008010_GRBM_STATUS0x8010
#define R_008014_GRBM_STATUS2   0x8014

A bit more info: If I reset the MC after resetting CP (this is what
Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
disappear, but there is still "ring test failed".

Huacai Chen

> 2011/11/8  :
>> And, I want to know something:
>> 1, Does GPU use MC to access GTT?
>
> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
> memory (vram or gart).
>
>> 2, What can cause MC timeout?
>
> Lots of things.  Some GPU client still active, some GPU client hung or
> not properly initialized.
>
> Alex
>
>>
>>> Hi,
>>>
>>> Some status update.
>>> ? 2011?9?29? ??5:17?Chen Jie  ???
 Hi,
 Add more information.
 We got occasionally "GPU lockup" after resuming from suspend(on mipsel
 platform with a mips64 compatible CPU and rs780e, the kernel is
 3.1.0-rc8
 64bit).  Related kernel message:
 /* return from STR */
 [  156.152343] radeon :01:05.0: WB enabled
 [  156.187500] [drm] ring test succeeded in 0 usecs
 [  156.187500] [drm] ib test succeeded in 0 usecs
 [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
 [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
 [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
 [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 [  156.597656] ata1.00: configured for UDMA/133
 [  156.613281] usb 1-5: reset high speed USB device number 4 using
 ehci_hcd
 [  157.027343] usb 3-2: reset low speed USB device number 2 using
 ohci_hcd
 [  157.609375] usb 3-3: reset low speed USB device number 3 using
 ohci_hcd
 [  157.683593] r8169 :02:00.0: eth0: link up
 [  165.621093] PM: resume of devices complete after 9679.556 msecs
 [  165.628906] Restarting tasks ... done.
 [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
 10019msec
 [  177.089843] [ cut here ]
 [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
 radeon_fence_wait+0x25c/0x33c()
 [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
 0x13AD)
 [  177.113281] Modules linked in: psmouse serio_raw
 [  177.117187] Call Trace:
 [  177.121093] [] dump_stack+0x8/0x34
 [  177.125000] [] warn_slowpath_common+0x78/0xa0
 [  177.132812] [] warn_slowpath_fmt+0x38/0x44
 [  177.136718] [] radeon_fence_wait+0x25c/0x33c
 [  177.144531] [] ttm_bo_wait+0x108/0x220
 [  177.148437] []
 radeon_gem_wait_idle_ioctl+0x80/0x114
 [  177.156250] [] drm_ioctl+0x2e4/0x3fc
 [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
 [  177.167968] [] compat_sys_ioctl+0x120/0x35c
 [  177.171875] [] handle_sys+0x118/0x138
 [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
 [  177.187500] radeon :01:05.0: GPU softreset
 [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
 [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
 [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
 [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
 [  177.367187] radeon :01:05.0:
 R_008020_GRBM_SOFT_RESET=0x7FEE
 [  177.390625] radeon :01:05.0:
 R_008020_GRBM_SOFT_RESET=0x0001
 [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
 [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
 [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
 [  177.433593] radeon :01:05.0: GPU reset succeed
 [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
 [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
 [  177.804687] radeon :01:05.0: WB enabled
 [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
 (scratch(0x8504)=0xCAFEDEAD)
>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>> something wrong with accessing through GTT.
>>>
>>> We dump gart table just after stopped cp, and compare gart table with
>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>> difference.
>>>
>>> Any idea?
>>>
 [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
 [ 

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-08 Thread che...@lemote.com
Thank you for your reply.

I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
active, but what it get from ring buffer is wrong. Then, I want to know
whether there is a way to check the content that GPU get from ring buffer.

BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
/sys/power/state" to do a hibernation, there will be occasionally "GPU
reset" just like suspend. However, if I use "echo reboot >
/sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
wakeup automatically, there is no "GPU reset" after hundreds of tests.
What does this imply? Power loss cause something break?

Best regards,

Huacai Chen


> 2011/12/7  :
>> When "MC timeout" happens at GPU reset, we found the 12th and 13th
>> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
>> two bits are like this:
>> #define G_000E50_MCDX_BUSY(x)  (((x) >> 12) & 1)
>> #define G_000E50_MCDW_BUSY(x)  (((x) >> 13) & 1)
>>
>> Could you please tell me what does they mean? And if possible,
>
> They refer to sub-blocks in the memory controller.  I don't really
> know off hand what the name mean.
>
>> I want to know the functionalities of these 5 registers in detail:
>> #define R_000E60_SRBM_SOFT_RESET   0x0E60
>> #define R_000E50_SRBM_STATUS   0x0E50
>> #define R_008020_GRBM_SOFT_RESET0x8020
>> #define R_008010_GRBM_STATUS0x8010
>> #define R_008014_GRBM_STATUS2   0x8014
>>
>> A bit more info: If I reset the MC after resetting CP (this is what
>> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
>> disappear, but there is still "ring test failed".
>
> The bits are defined in r600d.h.  As to the acronyms:
> BIF - Bus InterFace
> CG - clocks
> DC - Display Controller
> GRBM - Graphics block (3D engine)
> HDP - Host Data Path (CPU access to vram via the PCI BAR)
> IH, RLC - Interrupt controller
> MC - Memory controller
> ROM - ROM
> SEM - semaphore controller
>
> When you reset the MC, you will probably have to reset just about
> everything else since most blocks depend on the MC for access to
> memory.  If you do reset the MC, you should do it at prior to calling
> asic_init so you make sure all the hw gets re-initialized properly.
> Additionally, you should probably reset the GRBM either via
> SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
>
> Alex
>
>>
>> Huacai Chen
>>
>>> 2011/11/8  :
 And, I want to know something:
 1, Does GPU use MC to access GTT?
>>>
>>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>>> memory (vram or gart).
>>>
 2, What can cause MC timeout?
>>>
>>> Lots of things.  Some GPU client still active, some GPU client hung or
>>> not properly initialized.
>>>
>>> Alex
>>>

> Hi,
>
> Some status update.
> ? 2011?9?29? ??5:17?Chen Jie  ???
>> Hi,
>> Add more information.
>> We got occasionally "GPU lockup" after resuming from suspend(on
>> mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8
>> 64bit).  Related kernel message:
>> /* return from STR */
>> [  156.152343] radeon :01:05.0: WB enabled
>> [  156.187500] [drm] ring test succeeded in 0 usecs
>> [  156.187500] [drm] ib test succeeded in 0 usecs
>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl
>> 300)
>> [  156.597656] ata1.00: configured for UDMA/133
>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>> ehci_hcd
>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>> ohci_hcd
>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>> ohci_hcd
>> [  157.683593] r8169 :02:00.0: eth0: link up
>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>> [  165.628906] Restarting tasks ... done.
>> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more
>> than
>> 10019msec
>> [  177.089843] [ cut here ]
>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>> radeon_fence_wait+0x25c/0x33c()
>> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
>> 0x13AD)
>> [  177.113281] Modules linked in: psmouse serio_raw
>> [  177.117187] Call Trace:
>> [  177.121093] [] dump_stack+0x8/0x34
>> [  177.125000] [] warn_slowpath_common+0x78/0xa0
>> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
>> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
>> [  177.144531] [] ttm_bo_wait+0x108/0x220
>> [  177.148437] []
>> radeon_gem_wait_idle_ioctl+0x80/0x114
>> [  177.156250]

[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-12-16 Thread che...@lemote.com
> On Don, 2011-12-08 at 19:35 +0800, chenhc at lemote.com wrote:
>>
>> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
>> active, but what it get from ring buffer is wrong.
>
> CP_RB_WPTR is normally only changed by the CPU after adding commands to
> the ring buffer, so I'm afraid that may not be a valid conclusion.
>
>
I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
CP_RB_WPTR both changed, so I think CP is active.

>> Then, I want to know whether there is a way to check the content that
>> GPU get from ring buffer.
>
> See the r100_debugfs_cp_csq_fifo() function, which generates the output
> for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
>
Hmmm, I don't think this function can be used by r600 (or write a similar
one for R600), because I haven't found CSQ registers in r600 code.

>
>> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
>> /sys/power/state" to do a hibernation, there will be occasionally "GPU
>> reset" just like suspend. However, if I use "echo reboot >
>> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
>> wakeup automatically, there is no "GPU reset" after hundreds of tests.
>> What does this imply? Power loss cause something break?
>
> Yeah, it sounds like the resume code doesn't properly re-initialize
> something that's preserved on a warm boot but lost on a cold boot.
>
>
> --
> Earthling Michel D




[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2011-11-08 Thread che...@lemote.com
And, I want to know something:
1, Does GPU use MC to access GTT?
2, What can cause MC timeout?

> Hi,
>
> Some status update.
> ? 2011?9?29? ??5:17?Chen Jie  ???
>> Hi,
>> Add more information.
>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8
>> 64bit).  Related kernel message:
>> /* return from STR */
>> [  156.152343] radeon :01:05.0: WB enabled
>> [  156.187500] [drm] ring test succeeded in 0 usecs
>> [  156.187500] [drm] ib test succeeded in 0 usecs
>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [  156.597656] ata1.00: configured for UDMA/133
>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>> ehci_hcd
>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>> ohci_hcd
>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>> ohci_hcd
>> [  157.683593] r8169 :02:00.0: eth0: link up
>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>> [  165.628906] Restarting tasks ... done.
>> [  177.085937] radeon :01:05.0: GPU lockup CP stall for more than
>> 10019msec
>> [  177.089843] [ cut here ]
>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>> radeon_fence_wait+0x25c/0x33c()
>> [  177.105468] GPU lockup (waiting for 0x13C3 last fence id
>> 0x13AD)
>> [  177.113281] Modules linked in: psmouse serio_raw
>> [  177.117187] Call Trace:
>> [  177.121093] [] dump_stack+0x8/0x34
>> [  177.125000] [] warn_slowpath_common+0x78/0xa0
>> [  177.132812] [] warn_slowpath_fmt+0x38/0x44
>> [  177.136718] [] radeon_fence_wait+0x25c/0x33c
>> [  177.144531] [] ttm_bo_wait+0x108/0x220
>> [  177.148437] []
>> radeon_gem_wait_idle_ioctl+0x80/0x114
>> [  177.156250] [] drm_ioctl+0x2e4/0x3fc
>> [  177.160156] [] radeon_kms_compat_ioctl+0x28/0x38
>> [  177.167968] [] compat_sys_ioctl+0x120/0x35c
>> [  177.171875] [] handle_sys+0x118/0x138
>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>> [  177.187500] radeon :01:05.0: GPU softreset
>> [  177.191406] radeon :01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>> [  177.195312] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>> [  177.203125] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>> [  177.363281] radeon :01:05.0: Wait for MC idle timedout !
>> [  177.367187] radeon :01:05.0:
>> R_008020_GRBM_SOFT_RESET=0x7FEE
>> [  177.390625] radeon :01:05.0: R_008020_GRBM_SOFT_RESET=0x0001
>> [  177.414062] radeon :01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>> [  177.417968] radeon :01:05.0:   R_008014_GRBM_STATUS2=0x0003
>> [  177.425781] radeon :01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>> [  177.433593] radeon :01:05.0: GPU reset succeed
>> [  177.605468] radeon :01:05.0: Wait for MC idle timedout !
>> [  177.761718] radeon :01:05.0: Wait for MC idle timedout !
>> [  177.804687] radeon :01:05.0: WB enabled
>> [  178.00] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>> (scratch(0x8504)=0xCAFEDEAD)
> After pinned ring in VRAM, it warned an ib test failure. It seems
> something wrong with accessing through GTT.
>
> We dump gart table just after stopped cp, and compare gart table with
> the dumped one just after r600_pcie_gart_enable, and don't find any
> difference.
>
> Any idea?
>
>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>> schedule
>> IB(5).
>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>> schedule
>> IB(6).
>> ...
>
>
>
> Regards,
> -- Chen Jie
>




[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

2012-03-01 Thread che...@lemote.com
Status update:
In r600.c I found for RS780, num_*_threads are like this:
sq_thread_resource_mgmt = (NUM_PS_THREADS(79) |
   NUM_VS_THREADS(78) |
   NUM_GS_THREADS(4) |
   NUM_ES_THREADS(31));

But in documents, each of them should be a multiple of 4. And in
r600_blit_kms.c? they are 136, 48, 4, 4. I want to know why
79, 78, 4 and 31 are use here.

Huacai Chen

> On Wed, 2012-02-29 at 12:49 +0800, chenhc at lemote.com wrote:
>> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>> >> ? 2012?2?17? ??5:27?Chen Jie  ???
>> >> >> One good way to test gart is to go over GPU gart table and write a
>> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>> >> >> or somevalue that is unlikely to be already set. And then go over
>> >> >> all the page and check that GPU write succeed. Abusing the scratch
>> >> >> register write back feature is the easiest way to try that.
>> >> > I'm planning to add a GART table check procedure when resume, which
>> >> > will go over GPU gart table:
>> >> > 1. read(backup) a dword at end of each GPU page
>> >> > 2. write a mark by GPU and check it
>> >> > 3. restore the original dword
>> >> Attachment validateGART.patch do the job:
>> >> * It current only works for mips64 platform.
>> >> * To use it, apply all_in_vram.patch first, which will allocate CP
>> >> ring, ih, ib in VRAM and hard code no_wb=1.
>> >>
>> >> The gart test routine will be invoked in r600_resume. We've tried it,
>> >> and find that when lockup happened the gart table was good before
>> >> userspace restarting. The related dmesg follows:
>> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>> >> at 90004004, 32768 entries, Dummy
>> >> Page[0x0e004000-0x0e007fff]
>> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>> >> entries(valid=8544, invalid=24224, total=32768).
>> >> ...
>> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>> >> [ 1532.152343] Restarting tasks ... done.
>> >> [ 1544.468750] radeon :01:05.0: GPU lockup CP stall for more than
>> >> 10003msec
>> >> [ 1544.472656] [ cut here ]
>> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>> >> radeon_fence_wait+0x25c/0x314()
>> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>> >> 0x0002136A)
>> >> ...
>> >> [ 1544.886718] radeon :01:05.0: Wait for MC idle timedout !
>> >> [ 1545.046875] radeon :01:05.0: Wait for MC idle timedout !
>> >> [ 1545.062500] radeon :01:05.0: WB disabled
>> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>> >> [ 1545.109375] [drm] Enabling audio support
>> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>> >> at 90004004, 32768 entries, Dummy
>> >> Page[0x0e004000-0x0e007fff]
>> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>> >> entry=0x0e008067, orignal=0x745aaad1
>> >> ...
>> >> /* System blocked here. */
>> >>
>> >> Any idea?
>> >
>> > I know lockup are frustrating, my only idea is the memory controller
>> > is lockup because of some failing pci <-> system ram transaction.
>> >
>> >>
>> >> BTW, we find the following in r600_pcie_gart_enable()
>> >> (drivers/gpu/drm/radeon/r600.c):
>> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>> >> (u32)(rdev->dummy_page.addr >> 12));
>> >>
>> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
>> >
>> > No this should be handled properly.
>> >
>> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>> >> should change to:
>> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>> >>   radeon_gart_set_page(rdev, t, page_base);
>> >> - page_base += RADEON_GPU_PAGE_SIZE;
>> >> + if (page_base != rdev->dummy_page.addr)
>> >> + page_base += RADEON_GPU_PAGE_SIZE;
>> >>   }
>> >> ???
>> >
>> > No need to do so, dummy page will be 16K too, so it's fine.
>> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
>> is 0x8e004000, then there are four types of address in GART:0x8e004000,
>> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
>> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
>> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
>
> When radeon_gart_unbind initialize the gart entry to point to the dummy
> page it's just to have something safe in the GART table.
>
> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when
> there is a fault happening. It's like a sandbox for the mc. It doesn't
> conflict in anyway to have gart table entry to point to