Steinar Bang <s...@dod.no> writes: >>>>>> Ben Hutchings <b...@decadent.org.uk>: > >> Please send a readable photograph of this text. > > The problem occurred for the third time, and I couldn't find the camera, > so I'm typing in what's shown on the console. > > This time it had happened while the macine was sitting unmanned and I > can't say it had anything to do with the screen saver, unless someone > unintentionally have moved the mouse. > > I also note that it says "invalid opcode". This machine has an Intel P4 > CPU. Is it too old for the current kernels? > > Console text follows: > [523708.506472] ------------[ cut here ]----------- > [523708.506472] kernel BUG at > /build/build-linux_3.2.32-1-i386-Z3rOrf/linux-3.2.32/kernel/workqueue.c:1040!
This should not be a BUG IMHO, and it is in fact made easier debuggable in newer kernels: commit f5b2552b4ebbeadcadde1532d7bbd3f850719046 Author: Dan Carpenter <dan.carpen...@oracle.com> Date: Fri Apr 13 22:06:58 2012 +0300 workqueue: change BUG_ON() to WARN_ON() This BUG_ON() can be triggered if you call schedule_work() before calling INIT_WORK(). It is a bug definitely, but it's nicer to just print a stack trace and return. Reported-by: Matt Renzelmann <m...@cs.wisc.edu> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com> Signed-off-by: Tejun Heo <t...@kernel.org> diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5abf42f..66ec08d 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1032,7 +1032,10 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq, cwq = get_cwq(gcwq->cpu, wq); trace_workqueue_queue_work(cpu, cwq, work); - BUG_ON(!list_empty(&work->entry)); + if (WARN_ON(!list_empty(&work->entry))) { + spin_unlock_irqrestore(&gcwq->lock, flags); + return; + } cwq->nr_in_flight[cwq->work_color]++; work_flags = work_color_to_flags(cwq->work_color); Any chance that could be included in Debian wheezy kernels, although I guess it does not meet stable requirements? > [523708.506472] invalid opcode: 0000 [#1] SMP > [523708.506472] Modules linked in: mperf speedstep_lib ip6table_filter > ip6_tables cpufreq_powersave iptable_filter ip_tables cpufreq_stats > cpufreq_conservative cpufreq_userspace ebtable_nat ebtables x_tables ppdev lp > bnep rfcomm bluetooth rfkill crc16 binfmt_misc fuse nfsd nfs nfs_acl > auth_rpcgss fscache lockd sunrpc loop snd_intel8x0 snd_ac97_codec i915 > snd_pcm_oss snd_mixer_oss snd_pcm video snd_page_alloc drm_kms_helper > snd_seq_midi snd_seq_midi_event psmouse snd_rawmidi snd_seq snd_seq_device > snd_timer snd pcspkr drm i2c_i801 i2c_algo_bit soundcore ac97_bus i2c_core > iTCO_wdt serio_raw evdev parport_pc iTCO_vendor_support parport processor > thermal_sys rng_core button shpchp usbhid hid ext3 mbcache jbd dm_mod sg > sd_mod sr_mod cdrom crc_t10dif ata_generic floppy ata_piix libata uhci_hcd e > hci_hcd tg3 usbcore libphy scsi_mod usb_common [last unloaded: > scsi_wait_scan] > [523708.506472] > [523708.506472] Pid: 0, comm: swapper/0 Not tainted 3.2.0-4-686-pae #1 Debian > 3.2.32-1 Hewlett-Packard HP d530 CMT(DZ036T)/085Ch > [523708.506472] EIP: 0060:[<c10494b1>] EFLAGS: 00010013 CPU: 0 > [523708.506472] EIP is at __queue_work+0x193/0x1f4 > [523708.506472] EAX: f739e56c EBX: f708c800 ECX: 00000020 EDX: f739e568 > [523708.506472] ESI: c14b5240 EDI: 00000010 EBP: 00000046 ESP: f5809f60 > [523708.506472] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 > [523708.506472] Process swapper/0 (pid: 0, ti=f5808000 task=c13defe0 > task.ti=c13d8000) > [523708.506472] Stack: > [523708.506472] f739e568 f085fe80 00000000 f085fe80 00000000 00000010 > f7398000 c1049555 > [523708.506472] f739e568 f739e000 f0871400 f85abe17 c11e601f 0c00a511 > 00008000 00001930 > [523708.506472] f739e568 00000006 f739e028 00000046 00000046 f71147c0 > f58068d4 00000010 > [523708.506472] Call Trace: > [523708.506472] [<c1049555>] ? queue_work_on+0x25/0x30 > [523708.506472] [<f85abe17>] ? i8xx_irq_handler+0x6b/0x1dc [i915] I took a quick look at this, and my guess is that i8xx_irq_handler tries to queue an error event through i915_handle_error() here. The error_work work_struct is initialized in intel_irq_init(), so I cannot see how the error can happen unless something scribbles over it at some point. Which may be what happens here? That would be a lot easier to see if we could have queue_work fail with a warning instead. Maybe add a few extra debugging tests to i915_handle_error() to see if this is indeed what happens here? Completely untested of course: diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 32e1bda..614f3f4 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1414,6 +1414,19 @@ static void i915_report_and_clear_eir(struct drm_device *dev) } } +/* debugging helper only... */ +static bool safe_queue_work(struct workqueue_struct *wq, struct work_struct *work) +{ + if (WARN_ON(!test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) && + !list_empty(&work->entry))) { + pr_err("work->data=0x%08lx, &work->entry=%p, work->entry.next=%p, work->entry.prev=%p\n", + *work_data_bits(work), &work->entry, work->entry.next, work->entry.prev ); + return false; + } + + return queue_work(wq, work); +} + /** * i915_handle_error - handle an error interrupt * @dev: drm device @@ -1444,7 +1457,7 @@ void i915_handle_error(struct drm_device *dev, bool wedged) wake_up_all(&ring->irq_queue); } - queue_work(dev_priv->wq, &dev_priv->error_work); + safe_queue_work(dev_priv->wq, &dev_priv->error_work); } static void i915_pageflip_stall_check(struct drm_device *dev, int pipe) Bjørn -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87ehjwnsu9....@nemi.mork.no