Hi chaps, On 9 August 2010 14:41, Daniel J Blueman <daniel.blueman at gmail.com> wrote: > On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman at gmail.com> wrote: >> On 5 August 2010 16:23, Alex Deucher <alexdeucher at gmail.com> wrote: >>> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse at freedesktop.org> >>> wrote: >>>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote: >>>>> >>>>> After around 10-30 mins of typical usage (ie cairo rendering to >>>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon >>>>> locks up, giving a blank screen. >>>>> >>>>> The only information I have is a backtrace [1]. 'radeontool regs' >>>>> doesn't give anything - what information, /sys files etc would be >>>>> useful here, and would it be useful to log a freedesktop.org bug >>>>> report at this stage, assuming I haven't found a reliable reproducer? >>>>> >>>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21. >>>>> >>>>> Thanks, >>>>> ? Daniel >>>>> >>>>> --- [1] >>>>> >>>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec >>>>> ------------[ cut here ]------------ >>>>> WARNING: at >>>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 >>>>> radeon_fence_wait+0x2b7/0x320 [radeon]() >>>>> Hardware name: Studio 1557 >>>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) >>>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm >>>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp >>>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables >>>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp >>>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm >>>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack >>>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video >>>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp >>>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 >>>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core >>>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci >>>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c >>>>> libcrc32c >>>>> Pid: 1624, comm: Xorg Tainted: G ? ? ?D ? ? 2.6.35-020635-generic #020635 >>>>> Call Trace: >>>>> ? radeon_fence_wait+0x2b7/0x320 [radeon] >>>>> warn_slowpath_common+0x90/0xc0 >>>>> warn_slowpath_fmt+0x6e/0x70 >>>>> ? schedule_timeout+0x15a/0x2e0 >>>>> ? r600_irq_set+0x27d/0xc00 [radeon] >>>>> ? radeon_ring_commit+0xa3/0xb0 [radeon] >>>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >>>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >>>>> radeon_fence_wait+0x2b7/0x320 [radeon] >>>>> ? autoremove_wake_function+0x0/0x40 >>>>> radeon_sync_obj_wait+0x11/0x20 [radeon] >>>>> ttm_bo_wait+0x102/0x1b0 [ttm] >>>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>>>> radeon_move_blit+0x124/0x170 [radeon] >>>>> radeon_bo_move+0xda/0x1a0 [radeon] >>>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm] >>>>> ttm_bo_evict+0x1cc/0x270 [ttm] >>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>>> ttm_mem_evict_first+0xed/0x180 [ttm] >>>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] >>>>> ttm_bo_mem_force_space+0xab/0x110 [ttm] >>>>> ttm_bo_mem_space+0x305/0x370 [ttm] >>>>> ttm_bo_move_buffer+0xcf/0x140 [ttm] >>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>>> ttm_bo_validate+0xd5/0x100 [ttm] >>>>> ttm_bo_init+0x1f2/0x240 [ttm] >>>>> radeon_bo_create+0x121/0x240 [radeon] >>>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] >>>>> radeon_gem_object_create+0x89/0x100 [radeon] >>>>> ? drm_gem_handle_create+0x8b/0xa0 [drm] >>>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon] >>>>> drm_ioctl+0x283/0x460 [drm] >>>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] >>>>> ? do_readv_writev+0x16b/0x1e0 >>>>> vfs_ioctl+0x3a/0xc0 >>>>> do_vfs_ioctl+0x6d/0x1f0 >>>>> sys_ioctl+0x87/0xa0 >>>>> ? sys_writev+0x55/0xb0 >>>>> system_call_fastpath+0x16/0x1b >>>>> ---[ end trace c0bc12025fa8386c ]--- >>>>> radeon 0000:02:00.0: GPU softreset >>>>> radeon 0000:02:00.0: ? R_008010_GRBM_STATUS=0xA0003028 >>>>> radeon 0000:02:00.0: ? R_008014_GRBM_STATUS2=0x00000002 >>>>> radeon 0000:02:00.0: ? R_000E50_SRBM_STATUS=0x200000C0 >>>>> radeon 0000:02:00.0: ? R_008020_GRBM_SOFT_RESET=0x00007FEE >>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 >>>>> radeon 0000:02:00.0: ? R_008010_GRBM_STATUS=0x00003028 >>>>> radeon 0000:02:00.0: ? R_008014_GRBM_STATUS2=0x00000002 >>>>> radeon 0000:02:00.0: ? R_000E50_SRBM_STATUS=0x200000C0 >>>>> radeon 0000:02:00.0: GPU reset succeed >>>>> Clocks initialized ! >>>>> ring test succeeded in 1 usecs >>>>> ib test succeeded in 1 usecs >>>> >>>> SO it keep reseting ? According to log there is a GPU lockup but then >>>> a successfull GPU reset so GPU should resume fine (that's what log says) >>>> Best is to open a bug and attach full dmesg + lspci -vv and context >>>> in which the lockup happen >>> >>> Perhaps the lockup timeout is too short or interrupts are being >>> delivered late or something? ?There's a fdo or kernel bugzilla entry >>> where removing the lockup timeout fixed the issue. ?Does something >>> like this: >>> >>> --- a/drivers/gpu/drm/radeon/radeon_fence.c >>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c >>> @@ -237,9 +237,11 @@ retry: >>> ? ? ? ? ? ? ? ? ? ? ? ? * as signaled for now >>> ? ? ? ? ? ? ? ? ? ? ? ? */ >>> ? ? ? ? ? ? ? ? ? ? ? ?rdev->gpu_lockup = true; >>> +#if 0 >>> ? ? ? ? ? ? ? ? ? ? ? ?r = radeon_gpu_reset(rdev); >>> ? ? ? ? ? ? ? ? ? ? ? ?if (r) >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return r; >>> +#endif >>> ? ? ? ? ? ? ? ? ? ? ? ?WREG32(rdev->fence_drv.scratch_reg, fence->seq); >>> ? ? ? ? ? ? ? ? ? ? ? ?rdev->gpu_lockup = false; >>> ? ? ? ? ? ? ? ?} >>> >>> fix the issue? >> >> Correction, RV710. >> >> Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already >> excruciatingly long for the command processor to not have responded, >> but the issue I'm really hitting here is that X if left hanging after >> GPU reset. The GPU lockup was being reported at: >> >> [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon] >> [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0 >> [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70 >> [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0 >> [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon] >> [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon] >> [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >> [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >> [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon] >> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 >> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] >> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] >> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] >> >> X consistently hangs after GPU reset at (with sysrq-T): >> >> [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170 >> [<ffffffff8157647b>] mutex_lock+0x2b/0x50 >> [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon] >> [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40 >> [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm] >> [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0 >> [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon] >> [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon] >> [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper] >> [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper] >> [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon] >> [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon] >> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 >> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] >> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] >> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] >> >> I couldn't reproduce it (but did hit it a number of times over the >> last two weeks), so have been analysing with what we have. The kernel >> the backtrace is against isn't built with debug, so: >> >> radeon_pm_set_clocks(): >> ? 8ca80: ? ? ? 55 ? ? ? ? ? ? ? ? ? ? ?push ? %rbp >> ? 8ca81: ? ? ? 48 89 e5 ? ? ? ? ? ? ? ?mov ? ?%rsp,%rbp >> ? 8ca84: ? ? ? 41 57 ? ? ? ? ? ? ? ? ? push ? %r15 >> ? 8ca86: ? ? ? 41 56 ? ? ? ? ? ? ? ? ? push ? %r14 >> ? 8ca88: ? ? ? 41 55 ? ? ? ? ? ? ? ? ? push ? %r13 >> ? 8ca8a: ? ? ? 41 54 ? ? ? ? ? ? ? ? ? push ? %r12 >> ? 8ca8c: ? ? ? 53 ? ? ? ? ? ? ? ? ? ? ?push ? %rbx >> ? 8ca8d: ? ? ? 48 83 ec 48 ? ? ? ? ? ? sub ? ?$0x48,%rsp >> ? 8ca91: ? ? ? e8 00 00 00 00 ? ? ? ? ?callq ?8ca96 >> <radeon_pm_set_clocks+0x16> >> ? 8ca96: ? ? ? 49 89 fc ? ? ? ? ? ? ? ?mov ? ?%rdi,%r12 >> ? 8ca99: ? ? ? 48 8b 7f 08 ? ? ? ? ? ? mov ? ?0x8(%rdi),%rdi >> ? 8ca9d: ? ? ? 48 83 c7 20 ? ? ? ? ? ? add ? ?$0x20,%rdi >> ? 8caa1: ? ? ? e8 00 00 00 00 ? ? ? ? ?callq ?8caa6 >> <radeon_pm_set_clocks+0x26> >> ? 8caa6: ? ? ? 49 8d 84 24 a0 1c 00 ? ?lea ? ?0x1ca0(%r12),%rax >> ? 8caad: ? ? ? 00 >> ? 8caae: ? ? ? 48 89 c7 ? ? ? ? ? ? ? ?mov ? ?%rax,%rdi >> ? 8cab1: ? ? ? 48 89 45 90 ? ? ? ? ? ? mov ? ?%rax,-0x70(%rbp) >> ? 8cab5: ? ? ? e8 00 00 00 00 ? ? ? ? ?callq ?8caba >> <radeon_pm_set_clocks+0x3a> >> -> 8caba: ? ? ? 49 8d 8c 24 e8 09 00 ? ?lea ? ?0x9e8(%r12),%rcx >> ? 8cac1: ? ? ? 00 >> ? 8cac2: ? ? ? 48 89 cf ? ? ? ? ? ? ? ?mov ? ?%rcx,%rdi >> ? 8cac5: ? ? ? 48 89 4d 98 ? ? ? ? ? ? mov ? ?%rcx,-0x68(%rbp) >> ? 8cac9: ? ? ? e8 00 00 00 00 ? ? ? ? ?callq ?8cace >> <radeon_pm_set_clocks+0x4e> >> ? 8cace: ? ? ? 41 83 7c 24 78 1a ? ? ? cmpl ? $0x1a,0x78(%r12) >> ? 8cad4: ? ? ? 0f 86 46 01 00 00 ? ? ? jbe ? ?8cc20 >> <radeon_pm_set_clocks+0x1a0> >> ? 8cada: ? ? ? 41 80 bc 24 d0 0d 00 ? ?cmpb ? $0x0,0xdd0(%r12) >> ? 8cae1: ? ? ? 00 00 >> >> This corresponds to: >> >> static void radeon_pm_set_clocks(struct radeon_device *rdev) >> { >> ? ? ? ?int i; >> >> ? ? ? ?mutex_lock(&rdev->ddev->struct_mutex); >> ? ? ? ?mutex_lock(&rdev->vram_mutex); >> -> ? ? ?mutex_lock(&rdev->cp.mutex); >> >> We know r12 is the rdev struct pointer (%rdi is the first function >> argument gets loaded into r12), and 0x9e8 is plausable as the offset >> of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so >> the code is trying to acquire cp.mutex it seems. >> >> From the kernel log, the command proc ring test completed after the >> start of the GPU reset, which is the last callsite to unlock cp.mutex, >> ie >> <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit> >> before the hang at >> <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>. >> >> Looking through all the calls between these two spots, I just can't >> see anything that _could_ have locked cp.mutex, unless I'm misreading >> the disassembly of mutex_lock. >> >> Any thoughts? > > Well, looking further back down the call chain, radeon_bo_create takes > the vram_mutex across the ttm_bo_init call, so I was duped by the > compiler's instruction scheduling. > > I looked at various changes to the locking scheme which did address > the X hang (thus blackouts), but all attempts failed with lockdep > violations somewhere, so there is no minimal fix without revisiting > recent locking scheme changes [1,2,3]. > > I'll pass on some other minor fixes I've cooked up in the meantime. > Note that I am running with the kernel boot options 'dynpm=1 > radeon.dynclks=1' to stop everything getting toasty; let me know if > you need more info for this issue. > > Thanks, > ?Daniel > > [1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917 > [2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923 > [3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937
I'm hitting this issue again on 2.6.36-rc1; it reproduces the 1000ms timeout and subsequent hang every time on my RV710 when starting a game on doom3 (eg http://ftp.jeuxlinux.fr/files/doom3-linux-1.1.1286-demo.x86.run), even without the dympm/dynclks options. It looks like the locking strategy of the Matthew's patches needs reworking. I was able to reproduce the hang by forcing a GPU reset also. What else would help with this? Thanks, Daniel -- Daniel J Blueman