Hi On Wed, 21 Dec 2016, Takashi Iwai <ti...@suse.de> wrote: > On Tue, 20 Dec 2016 14:04:56 +0100, > Takashi Sakamoto wrote: >> >> Hi, >> >> On Dec 20 2016 20:47, Ioan-Adrian Ratiu wrote: >> > On Tue, 20 Dec 2016, Takashi Sakamoto <o-taka...@sakamocchi.jp> wrote: >> >> --- >> >> >> >> Hi, >> >> >> >>> Commit 16200948d835 ("ALSA: usb-audio: Fix race at stopping the stream") >> >>> fixes a race-codition but it also introduces another really nasty data >> >>> race >> >>> regression which makes my usb sound card [1] completely useless, throwing >> >>> the kernel into a panic if anything from userspace tries to start >> >>> playback. >> >>> >> >>> The problem is this: ep->data_subs is now set to NULL every time inside >> >>> wait_clear_urbs(). ep->data_subs is initalized only in one place in >> >>> start_endpoints(), then it is immediately wiped by the pre-existing call >> >>> to >> >>> wait_clear_urbs() inside snd_usb_endpoint_start(). >> >>> >> >>> To ilustrate, this is what happens in the non-irq part of the code: >> >>> >> >>> Step 1 (inside start_endpoints): ep->data_subs = subs; >> >>> Step 2 (inside start_endpoints): snd_usb_endpoint_start(ep, can_sleep); >> >>> Step 3 (inside snd_usb_endpoint_start): wait_clear_urbs(ep); >> >>> Step 4 (inside wait_clear_urbs): ep->data_subs = NULL; >> >>> >> >>> Here's a simplified call stack for the IRQ part (full one at the end): >> >>> >> >>> (NULL dereference, param "subs" is passed NULL value of ep->data_subs) >> >>> retire_playback_urb >> >>> retire_outbound_urb >> >>> snd_complete_urb >> >>> (...) >> >>> xhci_irq >> >>> >> >>> Looking at the git log there seems to be quite a lot of history in this >> >>> part of the codebase, dating back to 2012 or earlier. My fix is based on >> >>> 015618b90 ("ALSA: snd-usb: Fix URB cancellation at stream start") and >> >>> e9ba389c5 ("ALSA: usb-audio: Fix scheduling-while-atomic bug in PCM >> >>> capture >> >>> stream") with a few modifications of my own. My idea is to do the >> >>> ep->data_subs wiping before endpoint initialization in a manner similar >> >>> to these older commits by using stop_endpoints() in snd_usb_pcm_prepare() >> >>> and at the same time keep the ep->data_subs = NULL in wait_clear_urbs() >> >>> to >> >>> not trigger the recently fixed stream stopping race again. >> >>> >> >>> Full call stack: >> >>> >> >>> [ 131.093240] BUG: unable to handle kernel NULL pointer dereference at >> >>> 0000000000000010 >> >>> [ 131.094313] IP: retire_playback_urb+0x1b/0x160 [snd_usb_audio] >> >>> [ 131.095046] PGD 0 >> >>> [ 131.095047] >> >>> [ 131.096509] Oops: 0000 [#1] PREEMPT SMP >> >>> [ 131.097255] Modules linked in: fuse snd_usb_audio snd_usbmidi_lib >> >>> snd_rawmidi snd_seq_device ctr ccm arc4 ath9k intel_rapl ath9k_common >> >>> x86_pkg_temp_thermal ath9k_hw intel_powerclamp coretemp joydev mousedev >> >>> ath kvm_intel mac80211 kvm >> >>> input_leds hid_generic psmouse irqbypass usbhid hid crct10dif_pclmul >> >>> crc32_pclmul serio_raw crc32c_intel atkbd ghash_clmulni_intel >> >>> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic >> >>> snd_hda_intel pcbc aesni_intel libps2 cfg8 >> >>> 0211 aes_x86_64 crypto_simd dell_laptop glue_helper r8169 dell_smbios >> >>> snd_hda_codec sr_mod cryptd led_class mii rfkill snd_hwdep cdrom >> >>> snd_hda_core ac dcdbas i8042 xhci_pci serio xhci_hcd snd_pcm tpm_tis >> >>> battery tpm_tis_core lpc_ich tpm >> >>> snd_timer evdev shpchp i2c_i801 sch_fq_codel >> >>> [ 131.105551] CPU: 2 PID: 165 Comm: irq/29-xhci_hcd Not tainted >> >>> 4.9.0-gd824cdc58ba0 #10 >> >>> [ 131.107516] Hardware name: Dell Inc. Inspiron 3521/018DYG, BIOS A14 >> >>> 07/31/2015 >> >>> [ 131.109592] task: ffff880154a70000 task.stack: ffffc90000f48000 >> >>> [ 131.111746] RIP: 0010:retire_playback_urb+0x1b/0x160 [snd_usb_audio] >> >>> [ 131.113899] RSP: 0018:ffffc90000f4bc10 EFLAGS: 00010082 >> >>> [ 131.116080] RAX: ffffffffa04cabe0 RBX: 0000000000000000 RCX: >> >>> 0000000000000000 >> >>> [ 131.118284] RDX: 0000000000000000 RSI: ffff8801435a4c00 RDI: >> >>> 0000000000000000 >> >>> [ 131.120505] RBP: ffffc90000f4bc40 R08: 0000000000000001 R09: >> >>> 0000000000000001 >> >>> [ 131.122807] R10: 0000000000000001 R11: ffffffff82f60d6d R12: >> >>> ffff8801432a0238 >> >>> [ 131.125265] R13: ffff8801435a4c00 R14: 0000000000000000 R15: >> >>> ffff8801535c83b8 >> >>> [ 131.127723] FS: 0000000000000000(0000) GS:ffff88015a000000(0000) >> >>> knlGS:0000000000000000 >> >>> [ 131.130217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> >>> [ 131.132568] CR2: 0000000000000010 CR3: 0000000001e0b000 CR4: >> >>> 00000000001406e0 >> >>> [ 131.135007] Call Trace: >> >>> [ 131.137329] snd_complete_urb+0x190/0x2b0 [snd_usb_audio] >> >>> [ 131.139526] __usb_hcd_giveback_urb+0x8e/0x160 >> >>> [ 131.141948] usb_hcd_giveback_urb+0x43/0x110 >> >>> [ 131.144104] xhci_giveback_urb_in_irq.isra.22+0x7d/0xb0 [xhci_hcd] >> >>> [ 131.146521] finish_td.constprop.38+0x1de/0x2f0 [xhci_hcd] >> >>> [ 131.148736] xhci_irq+0x13a2/0x1ca0 [xhci_hcd] >> >>> [ 131.150972] ? trace_hardirqs_on+0xd/0x10 >> >>> [ 131.153209] ? _raw_spin_unlock_irq+0x2c/0x50 >> >>> [ 131.155390] ? irq_thread+0xb5/0x1d0 >> >>> [ 131.157771] xhci_msi_irq+0x11/0x20 [xhci_hcd] >> >>> [ 131.159938] irq_forced_thread_fn+0x2f/0x70 >> >>> [ 131.162072] ? irq_thread+0xb5/0x1d0 >> >>> [ 131.164141] irq_thread+0x149/0x1d0 >> >>> [ 131.166132] ? irq_finalize_oneshot.part.2+0xe0/0xe0 >> >>> [ 131.168142] ? wake_threads_waitq+0x30/0x30 >> >>> [ 131.170149] kthread+0x10f/0x150 >> >>> [ 131.172144] ? irq_thread_dtor+0xc0/0xc0 >> >>> [ 131.174139] ? kthread_create_on_node+0x40/0x40 >> >>> [ 131.176116] ret_from_fork+0x2a/0x40 >> >>> [ 131.178074] Code: 8b 77 64 4c 89 e7 e8 e5 fe ff ff eb c3 0f 1f 00 0f >> >>> 1f 44 00 00 55 31 d2 48 89 e5 41 57 41 56 41 55 41 54 53 48 89 fb 48 83 >> >>> ec 08 <48> 8b 47 10 48 8b 4f 70 48 c7 c7 88 7b 4d a0 4c 8b a0 78 01 00 >> >>> [ 131.182382] RIP: retire_playback_urb+0x1b/0x160 [snd_usb_audio] RSP: >> >>> ffffc90000f4bc10 >> >>> [ 131.184562] CR2: 0000000000000010 >> >>> >> >>> [1] 041e:3232 Creative Technology SoundBlaster X-FI HD >> >>> [2] >> >>> http://mailman.alsa-project.org/pipermail/alsa-devel/2016-December/115425.html >> >> >> >> I can regenerate this bug with my EMU 0404 USB. My understanding of this >> >> bug is >> >> quite similar to your perspective. This bug is quite critical. In my >> >> understanding, we encounter this bug in all of cases in which >> >> snd-usb-audio is >> >> applied for PCM playback substream. >> >> >> >> Below workaround can also suppress this bug. >> >> >> >> diff --git a/sound/usb/endpoint.c b/sound/usb/endpoint.c >> >> index a2cdf33..9b34f76 100644 >> >> --- a/sound/usb/endpoint.c >> >> +++ b/sound/usb/endpoint.c >> >> @@ -537,11 +537,14 @@ static int wait_clear_urbs(struct snd_usb_endpoint >> >> *ep) >> >> alive, ep->ep_num); >> >> clear_bit(EP_FLAG_STOPPING, &ep->flags); >> >> >> >> - ep->data_subs = NULL; >> >> ep->sync_slave = NULL; >> >> ep->retire_data_urb = NULL; >> >> ep->prepare_data_urb = NULL; >> >> >> >> + msleep(200); >> >> + >> >> + ep->data_subs = NULL; >> >> + >> >> return 0; >> >> } >> >> >> >> If initialization of 'struct snd_usb_endpoint.data_subs' can be done >> >> after all >> >> of queued URBs are processed and corresponding complete interrupts are >> >> cought, >> >> we can solve this critical bug. >> >> >> >>> Signed-off-by: Ioan-Adrian Ratiu <a...@adirat.com> >> >>> --- >> >>> sound/usb/endpoint.c | 11 ++--------- >> >>> sound/usb/endpoint.h | 2 +- >> >>> sound/usb/pcm.c | 13 ++++++------- >> >>> 3 files changed, 9 insertions(+), 17 deletions(-) >> >>> >> >>> diff --git a/sound/usb/endpoint.c b/sound/usb/endpoint.c >> >>> index a2cdf3370afe..4465f324c2c2 100644 >> >>> --- a/sound/usb/endpoint.c >> >>> +++ b/sound/usb/endpoint.c >> >>> @@ -920,9 +920,7 @@ int snd_usb_endpoint_set_params(struct >> >>> snd_usb_endpoint *ep, >> >>> /** >> >>> * snd_usb_endpoint_start: start an snd_usb_endpoint >> >>> * >> >>> - * @ep: the endpoint to start >> >>> - * @can_sleep: flag indicating whether the operation is executed in >> >>> - * non-atomic context >> >>> + * @ep: the endpoint to start >> >>> * >> >>> * A call to this function will increment the use count of the endpoint. >> >>> * In case it is not already running, the URBs for this endpoint will be >> >>> @@ -932,7 +930,7 @@ int snd_usb_endpoint_set_params(struct >> >>> snd_usb_endpoint *ep, >> >>> * >> >>> * Returns an error if the URB submission failed, 0 in all other cases. >> >>> */ >> >>> -int snd_usb_endpoint_start(struct snd_usb_endpoint *ep, bool can_sleep) >> >>> +int snd_usb_endpoint_start(struct snd_usb_endpoint *ep) >> >>> { >> >>> int err; >> >>> unsigned int i; >> >>> @@ -944,11 +942,6 @@ int snd_usb_endpoint_start(struct snd_usb_endpoint >> >>> *ep, bool can_sleep) >> >>> if (++ep->use_count != 1) >> >>> return 0; >> >>> >> >>> - /* just to be sure */ >> >>> - deactivate_urbs(ep, false); >> >>> - if (can_sleep) >> >>> - wait_clear_urbs(ep); >> >>> - >> >>> ep->active_mask = 0; >> >>> ep->unlink_mask = 0; >> >>> ep->phase = 0; >> >>> diff --git a/sound/usb/endpoint.h b/sound/usb/endpoint.h >> >>> index 6428392d8f62..584f295d7c77 100644 >> >>> --- a/sound/usb/endpoint.h >> >>> +++ b/sound/usb/endpoint.h >> >>> @@ -18,7 +18,7 @@ int snd_usb_endpoint_set_params(struct >> >>> snd_usb_endpoint *ep, >> >>> struct audioformat *fmt, >> >>> struct snd_usb_endpoint *sync_ep); >> >>> >> >>> -int snd_usb_endpoint_start(struct snd_usb_endpoint *ep, bool >> >>> can_sleep); >> >>> +int snd_usb_endpoint_start(struct snd_usb_endpoint *ep); >> >>> void snd_usb_endpoint_stop(struct snd_usb_endpoint *ep); >> >>> void snd_usb_endpoint_sync_pending_stop(struct snd_usb_endpoint *ep); >> >>> int snd_usb_endpoint_activate(struct snd_usb_endpoint *ep); >> >>> diff --git a/sound/usb/pcm.c b/sound/usb/pcm.c >> >>> index 34c6d4f2c0b6..db26f767f851 100644 >> >>> --- a/sound/usb/pcm.c >> >>> +++ b/sound/usb/pcm.c >> >>> @@ -218,7 +218,7 @@ int snd_usb_init_pitch(struct snd_usb_audio *chip, >> >>> int iface, >> >>> } >> >>> } >> >>> >> >>> -static int start_endpoints(struct snd_usb_substream *subs, bool >> >>> can_sleep) >> >>> +static int start_endpoints(struct snd_usb_substream *subs) >> >>> { >> >>> int err; >> >>> >> >>> @@ -231,7 +231,7 @@ static int start_endpoints(struct snd_usb_substream >> >>> *subs, bool can_sleep) >> >>> dev_dbg(&subs->dev->dev, "Starting data EP @%p\n", ep); >> >>> >> >>> ep->data_subs = subs; >> >>> - err = snd_usb_endpoint_start(ep, can_sleep); >> >>> + err = snd_usb_endpoint_start(ep); >> >>> if (err < 0) { >> >>> clear_bit(SUBSTREAM_FLAG_DATA_EP_STARTED, >> >>> &subs->flags); >> >>> return err; >> >>> @@ -260,7 +260,7 @@ static int start_endpoints(struct snd_usb_substream >> >>> *subs, bool can_sleep) >> >>> dev_dbg(&subs->dev->dev, "Starting sync EP @%p\n", ep); >> >>> >> >>> ep->sync_slave = subs->data_endpoint; >> >>> - err = snd_usb_endpoint_start(ep, can_sleep); >> >>> + err = snd_usb_endpoint_start(ep); >> >>> if (err < 0) { >> >>> clear_bit(SUBSTREAM_FLAG_SYNC_EP_STARTED, >> >>> &subs->flags); >> >>> return err; >> >>> @@ -809,8 +809,7 @@ static int snd_usb_pcm_prepare(struct >> >>> snd_pcm_substream *substream) >> >>> goto unlock; >> >>> } >> >>> >> >>> - snd_usb_endpoint_sync_pending_stop(subs->sync_endpoint); >> >>> - snd_usb_endpoint_sync_pending_stop(subs->data_endpoint); >> >>> + stop_endpoints(subs, true); >> >>> >> >>> ret = set_format(subs, subs->cur_audiofmt); >> >>> if (ret < 0) >> >>> @@ -850,7 +849,7 @@ static int snd_usb_pcm_prepare(struct >> >>> snd_pcm_substream *substream) >> >>> /* for playback, submit the URBs now; otherwise, the first >> >>> hwptr_done >> >>> * updates for all URBs would happen at the same time when >> >>> starting */ >> >>> if (subs->direction == SNDRV_PCM_STREAM_PLAYBACK) >> >>> - ret = start_endpoints(subs, true); >> >>> + return start_endpoints(subs); >> >>> >> >>> unlock: >> >>> snd_usb_unlock_shutdown(subs->stream->chip); >> >>> @@ -1666,7 +1665,7 @@ static int >> >>> snd_usb_substream_capture_trigger(struct snd_pcm_substream *substream >> >>> >> >>> switch (cmd) { >> >>> case SNDRV_PCM_TRIGGER_START: >> >>> - err = start_endpoints(subs, false); >> >>> + err = start_endpoints(subs); >> >>> if (err < 0) >> >>> return err; >> >> >> >> This patch works better, but not the best. It's a bit intrusive. >> > >> > I disagree. Being intrusive is not a good reason to reject a patch, >> > especially if the alternative is to insert random magic number delays >> > in hopes of synchronizing process and irq contexts to avoid critical >> > errors. >> >> I apologize if you got some negative opinions in my previous message. >> >> My intention to show the ugly workaround is not to merge it instead of >> yours, but clear what causes this bug. >> >> For example, we can also apply below patch for this bug, to solve it >> roughly. >> >> diff --git a/sound/usb/endpoint.c b/sound/usb/endpoint.c >> index a2cdf33..6ec4eb1 100644 >> --- a/sound/usb/endpoint.c >> +++ b/sound/usb/endpoint.c >> @@ -162,7 +162,7 @@ int snd_usb_endpoint_next_packet_size(struct >> snd_usb_endpoint *ep) >> static void retire_outbound_urb(struct snd_usb_endpoint *ep, >> struct snd_urb_ctx *urb_ctx) >> { >> - if (ep->retire_data_urb) >> + if (ep->retire_data_urb && ep->data_subs) >> ep->retire_data_urb(ep->data_subs, urb_ctx->urb); >> } >> >> But this is just a workaround and can't fix it thoroughly. The main >> issue is synchronization between interrupt/process contexts. >> >> > Please take the time to fully analyze my patch and let's have a >> > discussion based on it, not reject it outright and replace it with >> > a quick and dirty delay hack. >> >> OK. I'll deliberately check it again so that I have no overlooks. I >> with this thread also catch the other developers enough helpful to >> you. (I just eventually caught your patch in LKML and not developer >> for this category of devices.) > > Sorry for the late reply, as I've been (still) off and had bad net > connections. > > About your fix: Sakamoto-san is right, it's no good way to fix this > kind or problem. The easiest option right now is just to revert my > previous fix, as it obviously introduces another regression. The > correct fix will be taken after that. > > I'm going to prepare a revert patch and ask Linus to take it before > rc1.
I agree with reverting the initial commit decision because my problem disappears with it. But can you please state a reason for why my patch is "no good way to fix"? Being too intrusive is not a good reason. Ionel > > > thanks, > > Takashi > > > >> >> > Ionel >> > >> >> >> >> What we need to synchronize process contexts to interrupt context till >> >> all of >> >> queued URBs are handled. This seems to be a main purpose of >> >> wait_clear_bits, >> >> while current implementation of snd_complete_urb()/wait_clear_bits() is >> >> not >> >> enough. >> >> >> >> Thanks for your reporting. >> >> >> >> Takashi Sakamoto >> >> -- >> >> 2.9.3 >> >> >> Thanks >> >> Takashi Sakamoto >>