[PATCH] dm crypt: defer the decryption to a tasklet, when being called with interrupts disabled
On some specific hardware on early boot we occasionally get [ 1193.920255][T0] BUG: sleeping function called from invalid context at mm/mempool.c:381 [ 1193.936616][T0] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/69 [ 1193.953233][T0] no locks held by swapper/69/0. [ 1193.965871][T0] irq event stamp: 575062 [ 1193.977724][T0] hardirqs last enabled at (575061): [] tick_nohz_idle_exit+0xe2/0x3e0 [ 1194.002762][T0] hardirqs last disabled at (575062): [] flush_smp_call_function_from_idle+0x4f/0x80 [ 1194.029035][T0] softirqs last enabled at (575050): [] asm_call_irq_on_stack+0x12/0x20 [ 1194.054227][T0] softirqs last disabled at (575043): [] asm_call_irq_on_stack+0x12/0x20 [ 1194.079389][T0] CPU: 69 PID: 0 Comm: swapper/69 Not tainted 5.10.6-cloudflare-kasan-2021.1.4-dev #1 [ 1194.104103][T0] Hardware name: NULL R162-Z12-CD/MZ12-HD4-CD, BIOS R10 06/04/2020 [ 1194.119591][T0] Call Trace: [ 1194.130233][T0] dump_stack+0x9a/0xcc [ 1194.141617][T0] ___might_sleep.cold+0x180/0x1b0 [ 1194.153825][T0] mempool_alloc+0x16b/0x300 [ 1194.165313][T0] ? remove_element+0x160/0x160 [ 1194.176961][T0] ? blk_mq_end_request+0x4b/0x490 [ 1194.188778][T0] crypt_convert+0x27f6/0x45f0 [dm_crypt] [ 1194.201024][T0] ? rcu_read_lock_sched_held+0x3f/0x70 [ 1194.212906][T0] ? module_assert_mutex_or_preempt+0x3e/0x70 [ 1194.225318][T0] ? __module_address.part.0+0x1b/0x3a0 [ 1194.237212][T0] ? is_kernel_percpu_address+0x5b/0x190 [ 1194.249238][T0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [dm_crypt] [ 1194.261593][T0] ? is_module_address+0x25/0x40 [ 1194.272905][T0] ? static_obj+0x8a/0xc0 [ 1194.283582][T0] ? lockdep_init_map_waits+0x26a/0x700 [ 1194.295570][T0] ? __raw_spin_lock_init+0x39/0x110 [ 1194.307330][T0] kcryptd_crypt_read_convert+0x31c/0x560 [dm_crypt] [ 1194.320496][T0] ? kcryptd_queue_crypt+0x1be/0x380 [dm_crypt] [ 1194.333203][T0] blk_update_request+0x6d7/0x1500 [ 1194.344841][T0] ? blk_mq_trigger_softirq+0x190/0x190 [ 1194.356831][T0] blk_mq_end_request+0x4b/0x490 [ 1194.367994][T0] ? blk_mq_trigger_softirq+0x190/0x190 [ 1194.379693][T0] flush_smp_call_function_queue+0x24b/0x560 [ 1194.391847][T0] flush_smp_call_function_from_idle+0x59/0x80 [ 1194.403969][T0] do_idle+0x287/0x450 [ 1194.413891][T0] ? arch_cpu_idle_exit+0x40/0x40 [ 1194.424716][T0] ? lockdep_hardirqs_on_prepare+0x286/0x3f0 [ 1194.436399][T0] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 1194.447759][T0] cpu_startup_entry+0x19/0x20 [ 1194.458038][T0] secondary_startup_64_no_verify+0xb0/0xbb IO completion can be queued to a different CPU by the block subsystem as a "call single function/data". The CPU may run these routines from the idle task, but it does so with interrupts disabled. It is not a good idea to do decryption with irqs disabled even in an idle task context, so just defer it to a tasklet as with requests from hard irqs. Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: # v5.9+ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 53791138d78b..c56b01bfdb60 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -2091,8 +2091,11 @@ static void kcryptd_queue_crypt(struct dm_crypt_io *io) if ((bio_data_dir(io->base_bio) == READ && test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags)) || (bio_data_dir(io->base_bio) == WRITE && test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags))) { - if (in_irq()) { - /* Crypto API's "skcipher_walk_first() refuses to work in hard IRQ context */ + /* in_irq(): Crypto API's "skcipher_walk_first() refuses to work in hard IRQ context. +* irqs_disabled(): the kernel may run some IO completion from the idle thread, but + it is being executed with irqs disabled. +*/ + if (in_irq() || irqs_disabled()) { tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work); tasklet_schedule(&io->tasklet); return; -- 2.20.1
Re: [PATCH 5.10 045/152] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
On Mon, Jan 18, 2021 at 10:44 PM Pavel Machek wrote: > > > > Fix this by allocating crypto requests with GFP_ATOMIC mask in > > interrupt context. > ... > > This one is wrong. > > > > +++ b/drivers/md/dm-crypt.c > > @@ -1454,13 +1454,16 @@ static int crypt_convert_block_skcipher( > > - if (!ctx->r.req) > > - ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO); > > + if (!ctx->r.req) { > > + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? > > GFP_ATOMIC : GFP_NOIO); > > Good so far. Ugly but good. > > > -static void crypt_alloc_req_aead(struct crypt_config *cc, > > +static int crypt_alloc_req_aead(struct crypt_config *cc, > >struct convert_context *ctx) > > { > > - if (!ctx->r.req_aead) > > - ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO); > > + if (!ctx->r.req) { > > + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? > > GFP_ATOMIC : GFP_NOIO); > > + if (!ctx->r.req) > > + return -ENOMEM; > > + } > > But this one can't be good. We are now allocating different field in > the structure! Good catch! Sorry for the copy-paste. It is actually not a big deal, because this is not a structure, but a union: as long as the mempool was initialized with the correct size, it should be no different. Nevertheless, I'll send the patch to fix the typo. Regards, Ignat > Pavel > > -- > DENX Software Engineering GmbH, Managing Director:Wolfgang Denk > HRB 165235 Munich, Office: Kirchenstr.5, D-82194Groebenzell, Germany >
[PATCH] dm crypt: fix invalid copy paste in crypt_alloc_req_aead
In commit d68b295 ("dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq") I wrongly copy pasted crypto request allocation code from crypt_alloc_req_skcipher to crypt_alloc_req_aead. It is OK from runtime perspective as both simple encryption request pointer and AEAD request pointer are part of a union, but may confuse code reviewers. Fixes: d68b295 ("dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq") Cc: sta...@vger.kernel.org # v5.9+ Reported-by: Pavel Machek Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 8c874710f0bc..5a55617a08e6 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1481,9 +1481,9 @@ static int crypt_alloc_req_skcipher(struct crypt_config *cc, static int crypt_alloc_req_aead(struct crypt_config *cc, struct convert_context *ctx) { - if (!ctx->r.req) { - ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO); - if (!ctx->r.req) + if (!ctx->r.req_aead) { + ctx->r.req_aead = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO); + if (!ctx->r.req_aead) return -ENOMEM; } -- 2.20.1
[PATCH] dm crypt: do not call bio_endio() from the dm-crypt tasklet
fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 398.370160][ C74] == When the decryption fully completes in the tasklet, dm-crypt will call bio_endio(), which in turn will call clone_endio() from dm.c core code. That function frees the resources associated with the bio, including per bio private structures. For dm-crypt it will free the current struct dm_crypt_io, which contains our tasklet object, causing use-after-free, when the tasklet is being dequeued by the kernel. To avoid this, do not call bio_endio() from the current tasklet context, but delay its execution to the dm-crypt IO workqueue. Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: # v5.9+ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 24 +++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 53791138d78b..90f109d8a4a7 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1691,6 +1691,12 @@ static void crypt_inc_pending(struct dm_crypt_io *io) atomic_inc(&io->io_pending); } +static void kcryptd_io_bio_endio(struct work_struct *work) +{ + struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work); + bio_endio(io->base_bio); +} + /* * One of the bios was finished. Check for completion of * the whole request and correctly clean up the buffer. @@ -1713,7 +1719,23 @@ static void crypt_dec_pending(struct dm_crypt_io *io) kfree(io->integrity_metadata); base_bio->bi_status = error; - bio_endio(base_bio); + + /* +* If we are running this function from our tasklet, +* we can't call bio_endio() here, because it will call +* clone_endio() from dm.c, which in turn will +* free the current struct dm_crypt_io structure with +* our tasklet. In this case we need to delay bio_endio() +* execution to after the tasklet is done and dequeued. +*/ + if (tasklet_trylock(&io->tasklet)) { + tasklet_unlock(&io->tasklet); + bio_endio(base_bio); + return; + } + + INIT_WORK(&io->work, kcryptd_io_bio_endio); + queue_work(cc->io_queue, &io->work); } /* -- 2.20.1
[PATCH 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
/0xe50 [ 210.194313][C0] ? blk_queue_enter+0x6d0/0x6d0 [ 210.195372][C0] ? __bio_add_page+0x246/0x3d0 [ 210.196418][C0] ? bio_iov_iter_get_pages+0x7dd/0xbe0 [ 210.197611][C0] ? submit_bio+0xe2/0x460 [ 210.198481][C0] ? submit_bio_noacct+0xe50/0xe50 [ 210.199496][C0] ? free_unref_page_commit.constprop.0+0x130/0x330 [ 210.200825][C0] ? __blkdev_direct_IO_simple+0x43b/0x7e0 [ 210.202050][C0] ? bd_link_disk_holder+0x690/0x690 [ 210.203239][C0] ? put_pages_list+0x210/0x210 [ 210.204341][C0] ? scan_shadow_nodes+0xb0/0xb0 [ 210.205472][C0] ? _raw_write_lock_irqsave+0xe0/0xe0 [ 210.206698][C0] ? bd_may_claim+0xc0/0xc0 [ 210.207715][C0] ? zero_user_segments.constprop.0+0x2e0/0x2e0 [ 210.209092][C0] ? blkdev_direct_IO+0xd16/0x1020 [ 210.210200][C0] ? pagevec_lookup_range_tag+0x28/0x60 [ 210.211416][C0] ? __filemap_fdatawait_range+0xc4/0x1f0 [ 210.212669][C0] ? page_cache_next_miss+0x1e0/0x1e0 [ 210.213842][C0] ? generic_file_buffered_read+0x520/0x9e0 [ 210.215128][C0] ? delete_from_page_cache_batch+0x850/0x850 [ 210.216470][C0] ? bd_abort_claiming+0xd0/0xd0 [ 210.217531][C0] ? file_remove_privs+0x74/0x430 [ 210.218589][C0] ? filemap_check_errors+0x50/0xe0 [ 210.219705][C0] ? generic_file_direct_write+0x1a3/0x480 [ 210.220979][C0] ? __generic_file_write_iter+0x1d9/0x530 [ 210.38][C0] ? blkdev_write_iter+0x20d/0x3e0 [ 210.223328][C0] ? bd_unlink_disk_holder+0x360/0x360 [ 210.224464][C0] ? new_sync_write+0x37b/0x620 [ 210.225511][C0] ? new_sync_read+0x610/0x610 [ 210.226539][C0] ? _cond_resched+0x17/0x80 [ 210.227539][C0] ? inode_security+0x58/0x100 [ 210.228582][C0] ? security_file_permission+0x54/0x450 [ 210.229796][C0] ? vfs_write+0x524/0x770 [ 210.230758][C0] ? __x64_sys_pwrite64+0x197/0x1f0 [ 210.231890][C0] ? vfs_write+0x770/0x770 [ 210.232869][C0] ? do_syscall_64+0x33/0x40 [ 210.233839][C0] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by allocating crypto requests with GFP_ATOMIC mask in interrupt context Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") Reported-by: Maciej S. Szmigiero Cc: # v5.9+ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 5f9f9b3a226d..777b5c71a2f7 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1460,7 +1460,7 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc, unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1); if (!ctx->r.req) - ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO); + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO); skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]); @@ -1477,7 +1477,7 @@ static void crypt_alloc_req_aead(struct crypt_config *cc, struct convert_context *ctx) { if (!ctx->r.req_aead) - ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO); + ctx->r.req_aead = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO); aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]); -- 2.20.1
[PATCH 2/2] dm crypt: do not wait for backlogged crypto request completion in softirq
Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some code paths in dm-crypt to be executed in softirq context, when the underlying driver processes IO requests in interrupt/softirq context. When Crypto API backlogs a crypto request, dm-crypt uses wait_for_completion to avoid sending further requests to an already overloaded crypto driver. However, if the code is executing in softirq context, we might get the following stacktrace: [ 210.235213][C0] BUG: scheduling while atomic: fio/2602/0x0102 [ 210.236701][C0] Modules linked in: [ 210.237566][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW 5.10.0+ #50 [ 210.239292][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 210.241233][C0] Call Trace: [ 210.241946][C0] [ 210.242561][C0] dump_stack+0x7d/0xa3 [ 210.243466][C0] __schedule_bug.cold+0xb3/0xc2 [ 210.244539][C0] __schedule+0x156f/0x20d0 [ 210.245518][C0] ? io_schedule_timeout+0x140/0x140 [ 210.246660][C0] schedule+0xd0/0x270 [ 210.247541][C0] schedule_timeout+0x1fb/0x280 [ 210.248586][C0] ? usleep_range+0x150/0x150 [ 210.249624][C0] ? unpoison_range+0x3a/0x60 [ 210.250632][C0] ? kasan_kmalloc.constprop.0+0x82/0xa0 [ 210.251949][C0] ? unpoison_range+0x3a/0x60 [ 210.252958][C0] ? __prepare_to_swait+0xa7/0x190 [ 210.254067][C0] do_wait_for_common+0x2ab/0x370 [ 210.255158][C0] ? usleep_range+0x150/0x150 [ 210.256192][C0] ? bit_wait_io_timeout+0x160/0x160 [ 210.257358][C0] ? blk_update_request+0x757/0x1150 [ 210.258582][C0] ? _raw_spin_lock_irq+0x82/0xd0 [ 210.259674][C0] ? _raw_read_unlock_irqrestore+0x30/0x30 [ 210.260917][C0] wait_for_completion+0x4c/0x90 [ 210.261971][C0] crypt_convert+0x19a6/0x4c00 [ 210.263033][C0] ? _raw_spin_lock_irqsave+0x87/0xe0 [ 210.264193][C0] ? kasan_set_track+0x1c/0x30 [ 210.265191][C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [ 210.266283][C0] ? kmem_cache_free+0x104/0x470 [ 210.267363][C0] ? crypt_endio+0x91/0x180 [ 210.268327][C0] kcryptd_crypt_read_convert+0x30e/0x420 [ 210.269565][C0] blk_update_request+0x757/0x1150 [ 210.270563][C0] blk_mq_end_request+0x4b/0x480 [ 210.271680][C0] blk_done_softirq+0x21d/0x340 [ 210.272775][C0] ? _raw_spin_lock+0x81/0xd0 [ 210.273847][C0] ? blk_mq_stop_hw_queue+0x30/0x30 [ 210.275031][C0] ? _raw_read_lock_irq+0x40/0x40 [ 210.276182][C0] __do_softirq+0x190/0x611 [ 210.277203][C0] ? handle_edge_irq+0x221/0xb60 [ 210.278340][C0] asm_call_irq_on_stack+0x12/0x20 [ 210.279514][C0] [ 210.280164][C0] do_softirq_own_stack+0x37/0x40 [ 210.281281][C0] irq_exit_rcu+0x110/0x1b0 [ 210.282286][C0] common_interrupt+0x74/0x120 [ 210.283376][C0] asm_common_interrupt+0x1e/0x40 [ 210.284496][C0] RIP: 0010:_aesni_enc1+0x65/0xb0 Fix this by making crypt_convert function reentrant from the point of a single bio and make dm-crypt defer further bio processing to a workqueue, if Crypto API backlogs a request in interrupt context. Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workq ueues") Cc: # v5.9+ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 102 +++--- 1 file changed, 97 insertions(+), 5 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 777b5c71a2f7..6df907bd6c7c 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1529,13 +1529,19 @@ static void crypt_free_req(struct crypt_config *cc, void *req, struct bio *base_ * Encrypt / decrypt data from one bio to another one (can be the same one) */ static blk_status_t crypt_convert(struct crypt_config *cc, -struct convert_context *ctx, bool atomic) +struct convert_context *ctx, bool atomic, bool reset_pending) { unsigned int tag_offset = 0; unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT; int r; - atomic_set(&ctx->cc_pending, 1); + /* +* if reset_pending is set we are dealing with the bio for the first time, +* else we're continuing to work on the previous bio, so don't mess with +* the cc_pending counter +*/ + if (reset_pending) + atomic_set(&ctx->cc_pending, 1); while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) { @@ -1553,7 +1559,24 @@ static blk_status_t crypt_convert(struct crypt_config *cc, * but the driver request queue is full, let's wait. */ case -EBUSY: - wait_for_completion(&ctx->restart); + if (in_interrupt()) { + if (try_wait_for_completion(&ctx->restart)) { +
Re: [PATCH 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
On Wed, Dec 30, 2020 at 7:36 AM Hillf Danton wrote: > > On Tue, 29 Dec 2020 22:57:13 + > > > > Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd > > workqueues") > > Looks like a seperate fix to this commit is needed if what can be found > at (Subject: [patch 00/12] UBS: Cleanup in_interupt/in_irq/in_atomic() usage) > https://lore.kernel.org/lkml/20201014145215.518912...@linutronix.de/ I think the above request should be satisfied by device mapper core code itself rather than individual DM module implementations, as the execution context in the module is dependent on the underlying block driver: some block drivers complete requests in task contexts and some in interrupt - but the underlying block drivers should be transparent to the modules. The device mapper core code can pass context information to the modules if we are to avoid in_*irq() marcos in the code. > is correct. > > > Reported-by: Maciej S. Szmigiero > > Cc: # v5.9+ > > Signed-off-by: Ignat Korchagin > > --- > > drivers/md/dm-crypt.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c > > index 5f9f9b3a226d..777b5c71a2f7 100644 > > --- a/drivers/md/dm-crypt.c > > +++ b/drivers/md/dm-crypt.c > > @@ -1460,7 +1460,7 @@ static void crypt_alloc_req_skcipher(struct > > crypt_config *cc, > > unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1); > > > > if (!ctx->r.req) > > - ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO); > > + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? > > GFP_ATOMIC : GFP_NOIO); > > > > skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]); > > > > @@ -1477,7 +1477,7 @@ static void crypt_alloc_req_aead(struct crypt_config > > *cc, > >struct convert_context *ctx) > > { > > if (!ctx->r.req_aead) > > - ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO); > > + ctx->r.req_aead = mempool_alloc(&cc->req_pool, in_interrupt() > > ? GFP_ATOMIC : GFP_NOIO); > > > > aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]); > > > > -- > > 2.20.1
[PATCH v2 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
/0xe50 [ 210.194313][C0] ? blk_queue_enter+0x6d0/0x6d0 [ 210.195372][C0] ? __bio_add_page+0x246/0x3d0 [ 210.196418][C0] ? bio_iov_iter_get_pages+0x7dd/0xbe0 [ 210.197611][C0] ? submit_bio+0xe2/0x460 [ 210.198481][C0] ? submit_bio_noacct+0xe50/0xe50 [ 210.199496][C0] ? free_unref_page_commit.constprop.0+0x130/0x330 [ 210.200825][C0] ? __blkdev_direct_IO_simple+0x43b/0x7e0 [ 210.202050][C0] ? bd_link_disk_holder+0x690/0x690 [ 210.203239][C0] ? put_pages_list+0x210/0x210 [ 210.204341][C0] ? scan_shadow_nodes+0xb0/0xb0 [ 210.205472][C0] ? _raw_write_lock_irqsave+0xe0/0xe0 [ 210.206698][C0] ? bd_may_claim+0xc0/0xc0 [ 210.207715][C0] ? zero_user_segments.constprop.0+0x2e0/0x2e0 [ 210.209092][C0] ? blkdev_direct_IO+0xd16/0x1020 [ 210.210200][C0] ? pagevec_lookup_range_tag+0x28/0x60 [ 210.211416][C0] ? __filemap_fdatawait_range+0xc4/0x1f0 [ 210.212669][C0] ? page_cache_next_miss+0x1e0/0x1e0 [ 210.213842][C0] ? generic_file_buffered_read+0x520/0x9e0 [ 210.215128][C0] ? delete_from_page_cache_batch+0x850/0x850 [ 210.216470][C0] ? bd_abort_claiming+0xd0/0xd0 [ 210.217531][C0] ? file_remove_privs+0x74/0x430 [ 210.218589][C0] ? filemap_check_errors+0x50/0xe0 [ 210.219705][C0] ? generic_file_direct_write+0x1a3/0x480 [ 210.220979][C0] ? __generic_file_write_iter+0x1d9/0x530 [ 210.38][C0] ? blkdev_write_iter+0x20d/0x3e0 [ 210.223328][C0] ? bd_unlink_disk_holder+0x360/0x360 [ 210.224464][C0] ? new_sync_write+0x37b/0x620 [ 210.225511][C0] ? new_sync_read+0x610/0x610 [ 210.226539][C0] ? _cond_resched+0x17/0x80 [ 210.227539][C0] ? inode_security+0x58/0x100 [ 210.228582][C0] ? security_file_permission+0x54/0x450 [ 210.229796][C0] ? vfs_write+0x524/0x770 [ 210.230758][C0] ? __x64_sys_pwrite64+0x197/0x1f0 [ 210.231890][C0] ? vfs_write+0x770/0x770 [ 210.232869][C0] ? do_syscall_64+0x33/0x40 [ 210.233839][C0] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by allocating crypto requests with GFP_ATOMIC mask in interrupt context Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") Reported-by: Maciej S. Szmigiero Cc: # v5.9+ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 33 +++-- 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 53791138d78b..e4fd690c70e1 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1454,13 +1454,16 @@ static int crypt_convert_block_skcipher(struct crypt_config *cc, static void kcryptd_async_done(struct crypto_async_request *async_req, int error); -static void crypt_alloc_req_skcipher(struct crypt_config *cc, +static int crypt_alloc_req_skcipher(struct crypt_config *cc, struct convert_context *ctx) { unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1); - if (!ctx->r.req) - ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO); + if (!ctx->r.req) { + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO); + if (!ctx->r.req) + return -ENOMEM; + } skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]); @@ -1471,13 +1474,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc, skcipher_request_set_callback(ctx->r.req, CRYPTO_TFM_REQ_MAY_BACKLOG, kcryptd_async_done, dmreq_of_req(cc, ctx->r.req)); + + return 0; } -static void crypt_alloc_req_aead(struct crypt_config *cc, +static int crypt_alloc_req_aead(struct crypt_config *cc, struct convert_context *ctx) { - if (!ctx->r.req_aead) - ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO); + if (!ctx->r.req) { + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO); + if (!ctx->r.req) + return -ENOMEM; + } aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]); @@ -1488,15 +1496,17 @@ static void crypt_alloc_req_aead(struct crypt_config *cc, aead_request_set_callback(ctx->r.req_aead, CRYPTO_TFM_REQ_MAY_BACKLOG, kcryptd_async_done, dmreq_of_req(cc, ctx->r.req_aead)); + + return 0; } -static void crypt_alloc_req(struct crypt_config *cc, +static int crypt_alloc_req(struct crypt_config *cc, struct convert_context *ctx) { if (crypt_integrity_aead(cc)) - crypt_alloc_req_aead(cc, ctx); + return crypt_alloc_req_aead(cc, ctx); else - cr
[PATCH v2 2/2] dm crypt: do not wait for backlogged crypto request completion in softirq
Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some code paths in dm-crypt to be executed in softirq context, when the underlying driver processes IO requests in interrupt/softirq context. When Crypto API backlogs a crypto request, dm-crypt uses wait_for_completion to avoid sending further requests to an already overloaded crypto driver. However, if the code is executing in softirq context, we might get the following stacktrace: [ 210.235213][C0] BUG: scheduling while atomic: fio/2602/0x0102 [ 210.236701][C0] Modules linked in: [ 210.237566][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW 5.10.0+ #50 [ 210.239292][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 210.241233][C0] Call Trace: [ 210.241946][C0] [ 210.242561][C0] dump_stack+0x7d/0xa3 [ 210.243466][C0] __schedule_bug.cold+0xb3/0xc2 [ 210.244539][C0] __schedule+0x156f/0x20d0 [ 210.245518][C0] ? io_schedule_timeout+0x140/0x140 [ 210.246660][C0] schedule+0xd0/0x270 [ 210.247541][C0] schedule_timeout+0x1fb/0x280 [ 210.248586][C0] ? usleep_range+0x150/0x150 [ 210.249624][C0] ? unpoison_range+0x3a/0x60 [ 210.250632][C0] ? kasan_kmalloc.constprop.0+0x82/0xa0 [ 210.251949][C0] ? unpoison_range+0x3a/0x60 [ 210.252958][C0] ? __prepare_to_swait+0xa7/0x190 [ 210.254067][C0] do_wait_for_common+0x2ab/0x370 [ 210.255158][C0] ? usleep_range+0x150/0x150 [ 210.256192][C0] ? bit_wait_io_timeout+0x160/0x160 [ 210.257358][C0] ? blk_update_request+0x757/0x1150 [ 210.258582][C0] ? _raw_spin_lock_irq+0x82/0xd0 [ 210.259674][C0] ? _raw_read_unlock_irqrestore+0x30/0x30 [ 210.260917][C0] wait_for_completion+0x4c/0x90 [ 210.261971][C0] crypt_convert+0x19a6/0x4c00 [ 210.263033][C0] ? _raw_spin_lock_irqsave+0x87/0xe0 [ 210.264193][C0] ? kasan_set_track+0x1c/0x30 [ 210.265191][C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [ 210.266283][C0] ? kmem_cache_free+0x104/0x470 [ 210.267363][C0] ? crypt_endio+0x91/0x180 [ 210.268327][C0] kcryptd_crypt_read_convert+0x30e/0x420 [ 210.269565][C0] blk_update_request+0x757/0x1150 [ 210.270563][C0] blk_mq_end_request+0x4b/0x480 [ 210.271680][C0] blk_done_softirq+0x21d/0x340 [ 210.272775][C0] ? _raw_spin_lock+0x81/0xd0 [ 210.273847][C0] ? blk_mq_stop_hw_queue+0x30/0x30 [ 210.275031][C0] ? _raw_read_lock_irq+0x40/0x40 [ 210.276182][C0] __do_softirq+0x190/0x611 [ 210.277203][C0] ? handle_edge_irq+0x221/0xb60 [ 210.278340][C0] asm_call_irq_on_stack+0x12/0x20 [ 210.279514][C0] [ 210.280164][C0] do_softirq_own_stack+0x37/0x40 [ 210.281281][C0] irq_exit_rcu+0x110/0x1b0 [ 210.282286][C0] common_interrupt+0x74/0x120 [ 210.283376][C0] asm_common_interrupt+0x1e/0x40 [ 210.284496][C0] RIP: 0010:_aesni_enc1+0x65/0xb0 Fix this by making crypt_convert function reentrant from the point of a single bio and make dm-crypt defer further bio processing to a workqueue, if Crypto API backlogs a request in interrupt context. Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workq ueues") Cc: # v5.9+ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 102 +++--- 1 file changed, 97 insertions(+), 5 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index e4fd690c70e1..0d939fbe51af 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1539,13 +1539,19 @@ static void crypt_free_req(struct crypt_config *cc, void *req, struct bio *base_ * Encrypt / decrypt data from one bio to another one (can be the same one) */ static blk_status_t crypt_convert(struct crypt_config *cc, -struct convert_context *ctx, bool atomic) +struct convert_context *ctx, bool atomic, bool reset_pending) { unsigned int tag_offset = 0; unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT; int r; - atomic_set(&ctx->cc_pending, 1); + /* +* if reset_pending is set we are dealing with the bio for the first time, +* else we're continuing to work on the previous bio, so don't mess with +* the cc_pending counter +*/ + if (reset_pending) + atomic_set(&ctx->cc_pending, 1); while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) { @@ -1566,7 +1572,24 @@ static blk_status_t crypt_convert(struct crypt_config *cc, * but the driver request queue is full, let's wait. */ case -EBUSY: - wait_for_completion(&ctx->restart); + if (in_interrupt()) { + if (try_wait_for_completion(&ctx->restart)) { +
[PATCH v2 0/2] dm crypt: some fixes to support dm-crypt running in softirq context
Changes from v1: 0001: Handle memory allocation failure for GFP_ATOMIC Ignat Korchagin (2): dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq dm crypt: do not wait for backlogged crypto request completion in softirq drivers/md/dm-crypt.c | 135 +- 1 file changed, 120 insertions(+), 15 deletions(-) -- 2.20.1
Re: [PATCH 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
Yes, good call! Reposted with allocation failure handling. Thanks, Ignat On Wed, Dec 30, 2020 at 6:11 PM Mikulas Patocka wrote: > > Hi > > This patch doesn't handle allocation failure gracefully. > > Mikulas > > > > On Tue, 29 Dec 2020, Ignat Korchagin wrote: > > > Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some > > code > > paths in dm-crypt to be executed in softirq context, when the underlying > > driver > > processes IO requests in interrupt/softirq context. > > > > In this case sometimes when allocating a new crypto request we may get a > > stacktrace like below: > > > > [ 210.103008][C0] BUG: sleeping function called from invalid context > > at mm/mempool.c:381 > > [ 210.104746][C0] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, > > pid: 2602, name: fio > > [ 210.106599][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW > > 5.10.0+ #50 > > [ 210.108331][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, > > 1996), BIOS 0.0.0 02/06/2015 > > [ 210.110212][C0] Call Trace: > > [ 210.110921][C0] > > [ 210.111527][C0] dump_stack+0x7d/0xa3 > > [ 210.112411][C0] ___might_sleep.cold+0x122/0x151 > > [ 210.113527][C0] mempool_alloc+0x16b/0x2f0 > > [ 210.114524][C0] ? __queue_work+0x515/0xde0 > > [ 210.115553][C0] ? mempool_resize+0x700/0x700 > > [ 210.116586][C0] ? crypt_endio+0x91/0x180 > > [ 210.117479][C0] ? blk_update_request+0x757/0x1150 > > [ 210.118513][C0] ? blk_mq_end_request+0x4b/0x480 > > [ 210.119572][C0] ? blk_done_softirq+0x21d/0x340 > > [ 210.120628][C0] ? __do_softirq+0x190/0x611 > > [ 210.121626][C0] crypt_convert+0x29f9/0x4c00 > > [ 210.122668][C0] ? _raw_spin_lock_irqsave+0x87/0xe0 > > [ 210.123824][C0] ? kasan_set_track+0x1c/0x30 > > [ 210.124858][C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 > > [ 210.125930][C0] ? kmem_cache_free+0x104/0x470 > > [ 210.126973][C0] ? crypt_endio+0x91/0x180 > > [ 210.127947][C0] kcryptd_crypt_read_convert+0x30e/0x420 > > [ 210.129165][C0] blk_update_request+0x757/0x1150 > > [ 210.130231][C0] blk_mq_end_request+0x4b/0x480 > > [ 210.131294][C0] blk_done_softirq+0x21d/0x340 > > [ 210.132332][C0] ? _raw_spin_lock+0x81/0xd0 > > [ 210.133289][C0] ? blk_mq_stop_hw_queue+0x30/0x30 > > [ 210.134399][C0] ? _raw_read_lock_irq+0x40/0x40 > > [ 210.135458][C0] __do_softirq+0x190/0x611 > > [ 210.136409][C0] ? handle_edge_irq+0x221/0xb60 > > [ 210.137447][C0] asm_call_irq_on_stack+0x12/0x20 > > [ 210.138507][C0] > > [ 210.139118][C0] do_softirq_own_stack+0x37/0x40 > > [ 210.140191][C0] irq_exit_rcu+0x110/0x1b0 > > [ 210.141151][C0] common_interrupt+0x74/0x120 > > [ 210.142171][C0] asm_common_interrupt+0x1e/0x40 > > [ 210.143206][C0] RIP: 0010:_aesni_enc1+0x65/0xb0 > > [ 210.144313][C0] Code: 38 dc c2 41 0f 28 52 d0 66 0f 38 dc c2 41 0f > > 28 52 e0 66 0f 38 dc c2 41 0f 28 52 f0 66 0f 38 dc c2 41 0f 28 12 66 0f 38 > > dc c2 <41> 0f 28 52 10 66 0f 38 dc c2 41 0f 28 52 20 66 0f 38 dc c2 41 0f > > [ 210.148542][C0] RSP: 0018:88810dbe6db0 EFLAGS: 0286 > > [ 210.149842][C0] RAX: 9a90cdc0 RBX: RCX: > > 0200 > > [ 210.151576][C0] RDX: 888101e5f240 RSI: 888101e5f240 RDI: > > 888b5020 > > [ 210.153339][C0] RBP: 88810dbe6e20 R08: R09: > > 0020 > > [ 210.155063][C0] R10: 888b5090 R11: 111021b7cdcc R12: > > 9e87cd40 > > [ 210.156791][C0] R13: 888b5210 R14: 888101e5f0d8 R15: > > > > [ 210.158497][C0] ? aesni_set_key+0x1e0/0x1e0 > > [ 210.159523][C0] ? aesni_enc+0xf/0x20 > > [ 210.160408][C0] ? glue_xts_req_128bit+0x181/0x6f0 > > [ 210.161571][C0] ? aesni_set_key+0x1e0/0x1e0 > > [ 210.162560][C0] ? glue_ctr_req_128bit+0x630/0x630 > > [ 210.163706][C0] ? kasan_save_stack+0x37/0x50 > > [ 210.164761][C0] ? kasan_save_stack+0x20/0x50 > > [ 210.165786][C0] ? get_page_from_freelist+0x2052/0x36a0 > > [ 210.167024][C0] ? __blkdev_direct_IO_simple+0x43b/0x7e0 > > [ 210.168288][C0] ? blkdev_direct_IO+0xd16/0x1020 > > [ 210.169396][C0] ? generic_file_direct_write+0x1a3/0x480 > > [ 210.170648][C0] ? __generic_file_write_iter+0x1d9/0x530 > > [ 210.171882][C0] ? bl
Re: dm-crypt with no_read_workqueue and no_write_workqueue + btrfs scrub = BUG()
On Wed, Dec 23, 2020 at 3:37 PM Maciej S. Szmigiero wrote: > > On 14.12.2020 19:11, Maciej S. Szmigiero wrote: > > Hi, > > > > I hit a reproducible BUG() when scrubbing a btrfs fs on top of > > a dm-crypt device with no_read_workqueue and no_write_workqueue > > flags enabled. > > Still happens on the current torvalds/master. > > Due to this bug it is not possible to use btrfs on top of > a dm-crypt device with no_read_workqueue and no_write_workqueue > flags enabled. > > @Ignat: > Can you have a look at this as the person who added these flags? I've been looking into this for the last couple of days because of other reports [1]. Just finished testing a possible solution. Will submit soon. > It looks like to me that the skcipher API might not be safe to > call from a softirq context, after all. It is less about skcipher API and more about how dm-crypt uses it as well as some assumptions that it is always running in context which can sleep. > Maciej Ignat [1]: https://github.com/cloudflare/linux/issues/1
Re: dm-crypt with no_read_workqueue and no_write_workqueue + btrfs scrub = BUG()
On Wed, Dec 23, 2020 at 9:20 PM Maciej S. Szmigiero wrote: > > On 23.12.2020 22:09, Ignat Korchagin wrote: > (..) > > I've been looking into this for the last couple of days because of > > other reports [1]. > > Just finished testing a possible solution. Will submit soon. > > Thanks for looking into it. > > By the way, on a bare metal I am actually hitting a different problem > (scheduling while atomic) when scrubbing a btrfs filesystem, just as one > of your users from that GitHub report had [1]. That is because dm-crypt calls "wait_for_completion" in rare cases when Crypto API (cryptd) backlogs the decryption request. I've reproduced that one as well (even with no FS). We never hit these problems in the original testing probably due to the fact we mostly used xtsproxy custom crypto module, which is totally synchronous. I did test it later with standard crypto, but did not encounter these problems as well most likely because it is also depends which storage driver underneath we are using: most of them do not submit read requests to dm-crypt in irq/softirq context in the first place > I've pasted that backtrace in my original Dec 14 message. > > Thanks, > Maciej > > [1]: https://github.com/cloudflare/linux/issues/1#issuecomment-736734243 Regards, Ignat
Re: dm-crypt with no_read_workqueue and no_write_workqueue + btrfs scrub = BUG()
On Wed, Dec 23, 2020 at 8:57 PM Herbert Xu wrote: > > On Wed, Dec 23, 2020 at 04:37:34PM +0100, Maciej S. Szmigiero wrote: > > > > It looks like to me that the skcipher API might not be safe to > > call from a softirq context, after all. > > skcipher is safe to use in a softirq. The problem is only in > dm-crypt where it tries to allocate memory with GFP_NOIO. Hm.. After eliminating the GFP_NOIO (as well as some other sleeping paths) from dm-crypt softirq code I still hit an occasional crash in my extreme setup (QEMU with 1 CPU and cryptd_max_cpu_qlen set to 1) (decoded with stacktrace_decode.sh): [ 89.324723] BUG: kernel NULL pointer dereference, address: 0008 [ 89.325713] #PF: supervisor write access in kernel mode [ 89.326460] #PF: error_code(0x0002) - not-present page [ 89.327211] PGD 0 P4D 0 [ 89.327589] Oops: 0002 [#1] PREEMPT SMP PTI [ 89.328200] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 5.10.0+ #79 [ 89.329109] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 89.330284] Workqueue: cryptd cryptd_queue_worker [ 89.330999] RIP: 0010:crypto_dequeue_request (/cfsetup_build/./include/linux/list.h:112 /cfsetup_build/./include/linux/list.h:135 /cfsetup_build/./include/linux/list.h:146 /cfsetup_build/crypto/algapi.c:957) [ 89.331757] Code: e9 c9 d0 a8 48 c7 c7 f9 c9 d0 a8 e8 c2 88 fe ff 4c 8b 23 48 c7 c6 e9 c9 d0 a8 48 c7 c7 f9 c9 d0 a8 49 8b 14 24 49 8b 44 24 08 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 04 24 48 All code 0: e9 c9 d0 a8 48jmpq 0x48a8d0ce 5: c7 c7 f9 c9 d0 a8mov$0xa8d0c9f9,%edi b: e8 c2 88 fe ffcallq 0xfffe88d2 10: 4c 8b 23 mov(%rbx),%r12 13: 48 c7 c6 e9 c9 d0 a8 mov$0xa8d0c9e9,%rsi 1a: 48 c7 c7 f9 c9 d0 a8 mov$0xa8d0c9f9,%rdi 21: 49 8b 14 24 mov(%r12),%rdx 25: 49 8b 44 24 08mov0x8(%r12),%rax 2a:* 48 89 42 08 mov%rax,0x8(%rdx) <-- trapping instruction 2e: 48 89 10 mov%rdx,(%rax) 31: 48 b8 00 01 00 00 00 movabs $0xdead0100,%rax 38: 00 ad de 3b: 49 89 04 24 mov%rax,(%r12) 3f: 48rex.W Code starting with the faulting instruction === 0: 48 89 42 08 mov%rax,0x8(%rdx) 4: 48 89 10 mov%rdx,(%rax) 7: 48 b8 00 01 00 00 00 movabs $0xdead0100,%rax e: 00 ad de 11: 49 89 04 24 mov%rax,(%r12) 15: 48rex.W [ 89.334414] RSP: 0018:ba64c00bbe68 EFLAGS: 00010246 [ 89.335165] RAX: RBX: 9b9d6fc28d88 RCX: [ 89.336182] RDX: RSI: a8d0c9e9 RDI: a8d0c9f9 [ 89.337204] RBP: R08: a906e708 R09: 0058 [ 89.338208] R10: a9068720 R11: fc00 R12: 9b9a43797478 [ 89.339216] R13: 0020 R14: 9b9d6fc28e00 R15: [ 89.340231] FS: () GS:9b9d6fc0() knlGS: [ 89.341376] CS: 0010 DS: ES: CR0: 80050033 [ 89.342207] CR2: 0008 CR3: 00014cd76002 CR4: 00170ef0 [ 89.343238] Call Trace: [ 89.343609] cryptd_queue_worker (/cfsetup_build/crypto/cryptd.c:172) [ 89.344218] process_one_work (/cfsetup_build/./arch/x86/include/asm/preempt.h:26 /cfsetup_build/kernel/workqueue.c:2284) [ 89.344821] ? rescuer_thread (/cfsetup_build/kernel/workqueue.c:2364) [ 89.345399] worker_thread (/cfsetup_build/./include/linux/list.h:282 /cfsetup_build/kernel/workqueue.c:2422) [ 89.345923] ? rescuer_thread (/cfsetup_build/kernel/workqueue.c:2364) [ 89.346504] kthread (/cfsetup_build/kernel/kthread.c:292) [ 89.346986] ? kthread_create_worker_on_cpu (/cfsetup_build/kernel/kthread.c:245) [ 89.347713] ret_from_fork (/cfsetup_build/arch/x86/entry/entry_64.S:302) [ 89.348255] Modules linked in: [ 89.348708] CR2: 0008 [ 89.349197] ---[ end trace b7e9618b4122ed3b ]--- [ 89.349863] RIP: 0010:crypto_dequeue_request (/cfsetup_build/./include/linux/list.h:112 /cfsetup_build/./include/linux/list.h:135 /cfsetup_build/./include/linux/list.h:146 /cfsetup_build/crypto/algapi.c:957) [ 89.350606] Code: e9 c9 d0 a8 48 c7 c7 f9 c9 d0 a8 e8 c2 88 fe ff 4c 8b 23 48 c7 c6 e9 c9 d0 a8 48 c7 c7 f9 c9 d0 a8 49 8b 14 24 49 8b 44 24 08 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 04 24 48 All code 0: e9 c9 d0 a8 48jmpq 0x48a8d0ce 5: c7 c7 f9 c9 d0 a8mov$0xa8d0c9f9,%edi b: e8 c2 88 fe ffcallq 0xfffe88d2 10: 4c 8b 23 mov(%rbx),%r12 13: 48 c7 c6 e9 c9 d0 a8 mov$0xa8d0c9e9,%rsi 1a: 48 c7 c7 f9 c9 d0 a8 mov$0xa8d0c9f9,%rdi 21: 49 8b 14 24 mov(%r12),%rdx 25: 49 8b 44 24 08mov0x8(%r12),%rax 2a:* 48 89 42 08 mov%rax,0x8(%rdx) <--
Re: [PATCH v2 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
On Fri, Jan 1, 2021 at 10:00 AM Mikulas Patocka wrote: > > > > On Wed, 30 Dec 2020, Ignat Korchagin wrote: > > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c > > index 53791138d78b..e4fd690c70e1 100644 > > --- a/drivers/md/dm-crypt.c > > +++ b/drivers/md/dm-crypt.c > > @@ -1539,7 +1549,10 @@ static blk_status_t crypt_convert(struct > > crypt_config *cc, > > > > while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) { > > > > - crypt_alloc_req(cc, ctx); > > + r = crypt_alloc_req(cc, ctx); > > + if (r) > > + return BLK_STS_RESOURCE; > > + > > atomic_inc(&ctx->cc_pending); > > > > if (crypt_integrity_aead(cc)) > > -- > > 2.20.1 > > I'm not quite convinced that returning BLK_STS_RESOURCE will help. The > block layer will convert this value back to -ENOMEM and return it to the > caller, resulting in an I/O error. > > Note that GFP_ATOMIC allocations may fail anytime and you must handle > allocation failure gracefully - i.e. process the request without any > error. > > An acceptable solution would be to punt the request to a workqueue and do > GFP_NOIO allocation from the workqueue. Or add the request to some list > and process the list when some other request completes. We can do the workqueue, if that's the desired behaviour. The second patch from this patchset already adds the code for the request to be retried from the workqueue if crypt_convert returns BLK_STS_DEV_RESOURCE, so all is needed is to change returning BLK_STS_RESOURCE to BLK_STS_DEV_RESOURCE here. Does that sound reasonable? > You should write a test that simulates allocation failure and verify that > the kernel handles it gracefully without any I/O error. I already have the test for the second patch in the set, but will adapt it to make sure the allocation failure codepath is covered as well. > Mikulas >
[PATCH v3 1/2] dm crypt: do not wait for backlogged crypto request completion in softirq
Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some code paths in dm-crypt to be executed in softirq context, when the underlying driver processes IO requests in interrupt/softirq context. When Crypto API backlogs a crypto request, dm-crypt uses wait_for_completion to avoid sending further requests to an already overloaded crypto driver. However, if the code is executing in softirq context, we might get the following stacktrace: [ 210.235213][C0] BUG: scheduling while atomic: fio/2602/0x0102 [ 210.236701][C0] Modules linked in: [ 210.237566][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW 5.10.0+ #50 [ 210.239292][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 210.241233][C0] Call Trace: [ 210.241946][C0] [ 210.242561][C0] dump_stack+0x7d/0xa3 [ 210.243466][C0] __schedule_bug.cold+0xb3/0xc2 [ 210.244539][C0] __schedule+0x156f/0x20d0 [ 210.245518][C0] ? io_schedule_timeout+0x140/0x140 [ 210.246660][C0] schedule+0xd0/0x270 [ 210.247541][C0] schedule_timeout+0x1fb/0x280 [ 210.248586][C0] ? usleep_range+0x150/0x150 [ 210.249624][C0] ? unpoison_range+0x3a/0x60 [ 210.250632][C0] ? kasan_kmalloc.constprop.0+0x82/0xa0 [ 210.251949][C0] ? unpoison_range+0x3a/0x60 [ 210.252958][C0] ? __prepare_to_swait+0xa7/0x190 [ 210.254067][C0] do_wait_for_common+0x2ab/0x370 [ 210.255158][C0] ? usleep_range+0x150/0x150 [ 210.256192][C0] ? bit_wait_io_timeout+0x160/0x160 [ 210.257358][C0] ? blk_update_request+0x757/0x1150 [ 210.258582][C0] ? _raw_spin_lock_irq+0x82/0xd0 [ 210.259674][C0] ? _raw_read_unlock_irqrestore+0x30/0x30 [ 210.260917][C0] wait_for_completion+0x4c/0x90 [ 210.261971][C0] crypt_convert+0x19a6/0x4c00 [ 210.263033][C0] ? _raw_spin_lock_irqsave+0x87/0xe0 [ 210.264193][C0] ? kasan_set_track+0x1c/0x30 [ 210.265191][C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [ 210.266283][C0] ? kmem_cache_free+0x104/0x470 [ 210.267363][C0] ? crypt_endio+0x91/0x180 [ 210.268327][C0] kcryptd_crypt_read_convert+0x30e/0x420 [ 210.269565][C0] blk_update_request+0x757/0x1150 [ 210.270563][C0] blk_mq_end_request+0x4b/0x480 [ 210.271680][C0] blk_done_softirq+0x21d/0x340 [ 210.272775][C0] ? _raw_spin_lock+0x81/0xd0 [ 210.273847][C0] ? blk_mq_stop_hw_queue+0x30/0x30 [ 210.275031][C0] ? _raw_read_lock_irq+0x40/0x40 [ 210.276182][C0] __do_softirq+0x190/0x611 [ 210.277203][C0] ? handle_edge_irq+0x221/0xb60 [ 210.278340][C0] asm_call_irq_on_stack+0x12/0x20 [ 210.279514][C0] [ 210.280164][C0] do_softirq_own_stack+0x37/0x40 [ 210.281281][C0] irq_exit_rcu+0x110/0x1b0 [ 210.282286][C0] common_interrupt+0x74/0x120 [ 210.283376][C0] asm_common_interrupt+0x1e/0x40 [ 210.284496][C0] RIP: 0010:_aesni_enc1+0x65/0xb0 Fix this by making crypt_convert function reentrant from the point of a single bio and make dm-crypt defer further bio processing to a workqueue, if Crypto API backlogs a request in interrupt context. Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workq ueues") Cc: # v5.9+ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 103 -- 1 file changed, 98 insertions(+), 5 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 392337f16ecf..a3326dadfd4d 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1529,13 +1529,19 @@ static void crypt_free_req(struct crypt_config *cc, void *req, struct bio *base_ * Encrypt / decrypt data from one bio to another one (can be the same one) */ static blk_status_t crypt_convert(struct crypt_config *cc, -struct convert_context *ctx, bool atomic) +struct convert_context *ctx, bool atomic, bool reset_pending) { unsigned int tag_offset = 0; unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT; int r; - atomic_set(&ctx->cc_pending, 1); + /* +* if reset_pending is set we are dealing with the bio for the first time, +* else we're continuing to work on the previous bio, so don't mess with +* the cc_pending counter +*/ + if (reset_pending) + atomic_set(&ctx->cc_pending, 1); while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) { @@ -1553,7 +1559,25 @@ static blk_status_t crypt_convert(struct crypt_config *cc, * but the driver request queue is full, let's wait. */ case -EBUSY: - wait_for_completion(&ctx->restart); + if (in_interrupt()) { + if (try_wait_for_completion(&ctx->restart)) { +
[PATCH v3 0/2] dm crypt: some fixes to support dm-crypt running in softirq context
Changes from v1: * 0001: handle memory allocation failure for GFP_ATOMIC Changes from v2: * reordered patches * 0002: crypt_convert will be retried from a workqueue, when a crypto request allocation fails with GFP_ATOMIC instead of just returning an IO error to the user Ignat Korchagin (2): dm crypt: do not wait for backlogged crypto request completion in softirq dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq drivers/md/dm-crypt.c | 138 +- 1 file changed, 123 insertions(+), 15 deletions(-) -- 2.20.1
[PATCH v3 2/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
/0xe50 [ 210.194313][C0] ? blk_queue_enter+0x6d0/0x6d0 [ 210.195372][C0] ? __bio_add_page+0x246/0x3d0 [ 210.196418][C0] ? bio_iov_iter_get_pages+0x7dd/0xbe0 [ 210.197611][C0] ? submit_bio+0xe2/0x460 [ 210.198481][C0] ? submit_bio_noacct+0xe50/0xe50 [ 210.199496][C0] ? free_unref_page_commit.constprop.0+0x130/0x330 [ 210.200825][C0] ? __blkdev_direct_IO_simple+0x43b/0x7e0 [ 210.202050][C0] ? bd_link_disk_holder+0x690/0x690 [ 210.203239][C0] ? put_pages_list+0x210/0x210 [ 210.204341][C0] ? scan_shadow_nodes+0xb0/0xb0 [ 210.205472][C0] ? _raw_write_lock_irqsave+0xe0/0xe0 [ 210.206698][C0] ? bd_may_claim+0xc0/0xc0 [ 210.207715][C0] ? zero_user_segments.constprop.0+0x2e0/0x2e0 [ 210.209092][C0] ? blkdev_direct_IO+0xd16/0x1020 [ 210.210200][C0] ? pagevec_lookup_range_tag+0x28/0x60 [ 210.211416][C0] ? __filemap_fdatawait_range+0xc4/0x1f0 [ 210.212669][C0] ? page_cache_next_miss+0x1e0/0x1e0 [ 210.213842][C0] ? generic_file_buffered_read+0x520/0x9e0 [ 210.215128][C0] ? delete_from_page_cache_batch+0x850/0x850 [ 210.216470][C0] ? bd_abort_claiming+0xd0/0xd0 [ 210.217531][C0] ? file_remove_privs+0x74/0x430 [ 210.218589][C0] ? filemap_check_errors+0x50/0xe0 [ 210.219705][C0] ? generic_file_direct_write+0x1a3/0x480 [ 210.220979][C0] ? __generic_file_write_iter+0x1d9/0x530 [ 210.38][C0] ? blkdev_write_iter+0x20d/0x3e0 [ 210.223328][C0] ? bd_unlink_disk_holder+0x360/0x360 [ 210.224464][C0] ? new_sync_write+0x37b/0x620 [ 210.225511][C0] ? new_sync_read+0x610/0x610 [ 210.226539][C0] ? _cond_resched+0x17/0x80 [ 210.227539][C0] ? inode_security+0x58/0x100 [ 210.228582][C0] ? security_file_permission+0x54/0x450 [ 210.229796][C0] ? vfs_write+0x524/0x770 [ 210.230758][C0] ? __x64_sys_pwrite64+0x197/0x1f0 [ 210.231890][C0] ? vfs_write+0x770/0x770 [ 210.232869][C0] ? do_syscall_64+0x33/0x40 [ 210.233839][C0] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by allocating crypto requests with GFP_ATOMIC mask in interrupt context Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") Reported-by: Maciej S. Szmigiero Cc: # v5.9+ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 35 +-- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index a3326dadfd4d..1f471dd75144 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1454,13 +1454,16 @@ static int crypt_convert_block_skcipher(struct crypt_config *cc, static void kcryptd_async_done(struct crypto_async_request *async_req, int error); -static void crypt_alloc_req_skcipher(struct crypt_config *cc, +static int crypt_alloc_req_skcipher(struct crypt_config *cc, struct convert_context *ctx) { unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1); - if (!ctx->r.req) - ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO); + if (!ctx->r.req) { + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO); + if (!ctx->r.req) + return -ENOMEM; + } skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]); @@ -1471,13 +1474,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc, skcipher_request_set_callback(ctx->r.req, CRYPTO_TFM_REQ_MAY_BACKLOG, kcryptd_async_done, dmreq_of_req(cc, ctx->r.req)); + + return 0; } -static void crypt_alloc_req_aead(struct crypt_config *cc, +static int crypt_alloc_req_aead(struct crypt_config *cc, struct convert_context *ctx) { - if (!ctx->r.req_aead) - ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO); + if (!ctx->r.req) { + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO); + if (!ctx->r.req) + return -ENOMEM; + } aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]); @@ -1488,15 +1496,17 @@ static void crypt_alloc_req_aead(struct crypt_config *cc, aead_request_set_callback(ctx->r.req_aead, CRYPTO_TFM_REQ_MAY_BACKLOG, kcryptd_async_done, dmreq_of_req(cc, ctx->r.req_aead)); + + return 0; } -static void crypt_alloc_req(struct crypt_config *cc, +static int crypt_alloc_req(struct crypt_config *cc, struct convert_context *ctx) { if (crypt_integrity_aead(cc)) - crypt_alloc_req_aead(cc, ctx); + return crypt_alloc_req_aead(cc, ctx); else - cr
Re: Linux 4.19 and GCC 9
Hi Greg, > > For us it seems applying the following 4 mainline patches makes 4.19.x > > branch perf compile with GCC-9: > > > > 4d0f16d059ddb91424480d88473f7392f24aebdc: perf ui helpline: Use > > strlcpy() as a shorter form of strncpy() + explicit set nul > > b6313899f4ed2e76b8375cf8069556f5b94fbff0: perf help: Remove needless > > use of strncpy() > > 5192bde7d98c99f2cd80225649e3c2e7493722f7: perf header: Fix unchecked > > usage of strncpy() > > 97acec7df172cd1e450f81f5e293c0aa145a2797: perf data: Fix 'strncat may > > truncate' build failure with recent gcc > > > > I also checked that 4.19.49 compiles fine with GCC 9, although with a > > lot of warnings, mostly from objtool, like "warning: objtool: > > sock_register()+0xd: sibling call from callable instruction with > > modified stack frame". But it's a start. > > > > Can we apply the above-mentioned patches, please? > I'll look into these after the next round of kernels are released. I Did you by any chance forget to queue these patches? :) (the build is still broken for GCC 9.1) > guess I'll go find a distro that has gcc9 on it to actually test > things... BTW, Arch already has GCC 9.1 package, so no need to compile your own anymore for testing: https://www.archlinux.org/packages/core/x86_64/gcc/ Regards, Ignat
Re: Linux 4.19 and GCC 9
Hi Greg, For us it seems applying the following 4 mainline patches makes 4.19.x branch perf compile with GCC-9: 4d0f16d059ddb91424480d88473f7392f24aebdc: perf ui helpline: Use strlcpy() as a shorter form of strncpy() + explicit set nul b6313899f4ed2e76b8375cf8069556f5b94fbff0: perf help: Remove needless use of strncpy() 5192bde7d98c99f2cd80225649e3c2e7493722f7: perf header: Fix unchecked usage of strncpy() 97acec7df172cd1e450f81f5e293c0aa145a2797: perf data: Fix 'strncat may truncate' build failure with recent gcc I also checked that 4.19.49 compiles fine with GCC 9, although with a lot of warnings, mostly from objtool, like "warning: objtool: sock_register()+0xd: sibling call from callable instruction with modified stack frame". But it's a start. Can we apply the above-mentioned patches, please? Regards, Ignat On Mon, Jun 10, 2019 at 8:45 AM Greg KH wrote: > > On Mon, Jun 10, 2019 at 12:21:51AM -0700, Ivan Babrou wrote: > > Looks like 4.19.49 received some patches for GCC 9+, but unfortunately > > perf still doesn't want to compile: > > > > [07:15:32]In file included from /usr/include/string.h:635, > > [07:15:32] from util/debug.h:7, > > [07:15:32] from builtin-help.c:15: > > [07:15:32]In function 'strncpy', > > [07:15:32] inlined from 'add_man_viewer' at builtin-help.c:192:2, > > [07:15:32] inlined from 'perf_help_config' at builtin-help.c:284:3: > > [07:15:32]/usr/include/x86_64-linux-gnu/bits/string3.h:126:10: error: > > '__builtin_strncpy' output truncated before terminating nul copying as > > many bytes from a string as its length [-Werror=stringop-truncation] > > [07:15:32] 126 | return __builtin___strncpy_chk (__dest, __src, __len, > > __bos (__dest)); > > [07:15:32] | ^~ > > [07:15:32]builtin-help.c: In function 'perf_help_config': > > [07:15:32]builtin-help.c:187:15: note: length computed here > > [07:15:32] 187 | size_t len = strlen(name); > > [07:15:32] | ^~~~ > > [07:15:32]cc1: all warnings being treated as errors > > > Any chance in finding a patch in Linus's tree that resolves this? I > don't have gcc9 on my systems here yet to test this. > > thanks, > > greg k-h > > -- > You received this message because you are subscribed to the Google Groups > "kernel-team" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@cloudflare.com. > To view this discussion on the web visit > https://groups.google.com/a/cloudflare.com/d/msgid/kernel-team/20190610074510.GA24746%40kroah.com.
Re: Linux 4.19 and GCC 9
On Mon, Jun 10, 2019 at 3:49 PM Greg KH wrote: > > > > > I typically compile a bare-bones GCC for those things, it is quite quick. > > Pointers to how to do that is appreciated. It's been years since I had > to build gcc "from scratch". This is how we do it, but we use it for some other projects as well, so need ligcc and c++ support. I suspect for kernel-only there may be a more lightweight approach (for example, by dropping c++): Env: Debian Stretch (we run in a simple official docker container with build-essential and make installed) - but probably should work on any distro Assuming the sources are extracted into $(BUILDDIR)/gcc-$(VERSION) cd $(BUILDDIR)/gcc-$(VERSION) ./contrib/download_prerequisites cd .. mkdir gcc-build cd gcc-build ../gcc-$(VERSION)/configure --enable-languages=c,c++ --build=x86_64-linux-gnu --disable-multilib make -j sudo make install (or install into alternative dir and point Linux build system there) Regards, Ignat
[PATCH v2 1/3] um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
For statically linked UML build it is important to take into account the standard C-library implementation. Some implementations, notably glibc have caveats: even when linked statically, the final program might require some runtime dependencies, if certain functions are used within the code. Consider the following program: int main(void) { getpwent(); return 0; } Compiling this program and linking statically with glibc produces the following warning from the linker: /usr/sbin/ld: /tmp/ccuthw1o.o: in function `main': test.c:(.text+0x5): warning: Using 'getpwent' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking We will use the flag to detect such C-library implementation build time and possibly disable static linking for UML to avoid producing a binary with unexpected behaviour and dependencies. Signed-off-by: Ignat Korchagin --- init/Kconfig | 6 ++ scripts/cc-can-link.sh | 5 +++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index a46aa8f3174d..717a3ada765e 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -57,6 +57,12 @@ config CC_CAN_LINK_STATIC default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) -static $(m64-flag)) if 64BIT default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) -static $(m32-flag)) +config CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS + bool + depends on UML && CC_CAN_LINK_STATIC + default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) -static -Xlinker --fatal-warnings $(m64-flag)) if 64BIT + default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) -static -Xlinker --fatal-warnings $(m32-flag)) + config CC_HAS_ASM_GOTO def_bool $(success,$(srctree)/scripts/gcc-goto.sh $(CC)) diff --git a/scripts/cc-can-link.sh b/scripts/cc-can-link.sh index 6efcead31989..e5011a46103e 100755 --- a/scripts/cc-can-link.sh +++ b/scripts/cc-can-link.sh @@ -2,10 +2,11 @@ # SPDX-License-Identifier: GPL-2.0 cat << "END" | $@ -x c - -o /dev/null >/dev/null 2>&1 -#include +#include +#include int main(void) { - printf(""); + getpwent(); return 0; } END -- 2.20.1
[PATCH v2 3/3] um: allow static linking for non-glibc implementations
It is possible to produce a statically linked UML binary with UML_NET_VECTOR, UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc implementations, which do not rely on NSS, such as musl. Allow static linking in this case. Signed-off-by: Ignat Korchagin --- arch/um/Kconfig | 2 +- arch/um/drivers/Kconfig | 3 --- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 9318dc6d1a0c..af7ed63f9c74 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -67,7 +67,7 @@ config FORBID_STATIC_LINK config STATIC_LINK bool "Force a static link" - depends on !FORBID_STATIC_LINK + depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || (!UML_NET_VECTOR && !UML_NET_VDE && !UML_NET_PCAP) help This option gives you the ability to force a static link of UML. Normally, UML is linked as a shared binary. This is inconvenient for diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig index 9160ead56e33..72d417055782 100644 --- a/arch/um/drivers/Kconfig +++ b/arch/um/drivers/Kconfig @@ -234,7 +234,6 @@ config UML_NET_DAEMON config UML_NET_VECTOR bool "Vector I/O high performance network devices" depends on UML_NET - select FORBID_STATIC_LINK help This User-Mode Linux network driver uses multi-message send and receive functions. The host running the UML guest must have @@ -246,7 +245,6 @@ config UML_NET_VECTOR config UML_NET_VDE bool "VDE transport (obsolete)" depends on UML_NET - select FORBID_STATIC_LINK help This User-Mode Linux network transport allows one or more running UMLs on a single host to communicate with each other and also @@ -294,7 +292,6 @@ config UML_NET_MCAST config UML_NET_PCAP bool "pcap transport (obsolete)" depends on UML_NET - select FORBID_STATIC_LINK help The pcap transport makes a pcap packet stream on the host look like an ethernet device inside UML. This is useful for making -- 2.20.1
[PATCH v2 2/3] um: some fixes to build UML with musl
musl toolchain and headers are a bit more strict. These fixes enable building UML with musl as well as seem not to break on glibc. Signed-off-by: Ignat Korchagin --- arch/um/drivers/daemon_user.c | 1 + arch/um/drivers/pcap_user.c | 12 ++-- arch/um/drivers/slip_user.c | 2 +- arch/um/drivers/vector_user.c | 4 +--- arch/um/os-Linux/util.c | 2 +- arch/x86/um/user-offsets.c| 2 +- 6 files changed, 11 insertions(+), 12 deletions(-) diff --git a/arch/um/drivers/daemon_user.c b/arch/um/drivers/daemon_user.c index 3695821d06a2..785baedc3555 100644 --- a/arch/um/drivers/daemon_user.c +++ b/arch/um/drivers/daemon_user.c @@ -7,6 +7,7 @@ */ #include +#include #include #include #include diff --git a/arch/um/drivers/pcap_user.c b/arch/um/drivers/pcap_user.c index bbd20638788a..52ddda3e3b10 100644 --- a/arch/um/drivers/pcap_user.c +++ b/arch/um/drivers/pcap_user.c @@ -32,7 +32,7 @@ static int pcap_user_init(void *data, void *dev) return 0; } -static int pcap_open(void *data) +static int pcap_user_open(void *data) { struct pcap_data *pri = data; __u32 netmask; @@ -44,14 +44,14 @@ static int pcap_open(void *data) if (pri->filter != NULL) { err = dev_netmask(pri->dev, &netmask); if (err < 0) { - printk(UM_KERN_ERR "pcap_open : dev_netmask failed\n"); + printk(UM_KERN_ERR "pcap_user_open : dev_netmask failed\n"); return -EIO; } pri->compiled = uml_kmalloc(sizeof(struct bpf_program), UM_GFP_KERNEL); if (pri->compiled == NULL) { - printk(UM_KERN_ERR "pcap_open : kmalloc failed\n"); + printk(UM_KERN_ERR "pcap_user_open : kmalloc failed\n"); return -ENOMEM; } @@ -59,14 +59,14 @@ static int pcap_open(void *data) (struct bpf_program *) pri->compiled, pri->filter, pri->optimize, netmask); if (err < 0) { - printk(UM_KERN_ERR "pcap_open : pcap_compile failed - " + printk(UM_KERN_ERR "pcap_user_open : pcap_compile failed - " "'%s'\n", pcap_geterr(pri->pcap)); goto out; } err = pcap_setfilter(pri->pcap, pri->compiled); if (err < 0) { - printk(UM_KERN_ERR "pcap_open : pcap_setfilter " + printk(UM_KERN_ERR "pcap_user_open : pcap_setfilter " "failed - '%s'\n", pcap_geterr(pri->pcap)); goto out; } @@ -127,7 +127,7 @@ int pcap_user_read(int fd, void *buffer, int len, struct pcap_data *pri) const struct net_user_info pcap_user_info = { .init = pcap_user_init, - .open = pcap_open, + .open = pcap_user_open, .close = NULL, .remove = pcap_remove, .add_address= NULL, diff --git a/arch/um/drivers/slip_user.c b/arch/um/drivers/slip_user.c index 8016d32b6809..482a19c5105c 100644 --- a/arch/um/drivers/slip_user.c +++ b/arch/um/drivers/slip_user.c @@ -9,7 +9,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c index c4a0f26b2824..45d4164ad355 100644 --- a/arch/um/drivers/vector_user.c +++ b/arch/um/drivers/vector_user.c @@ -18,9 +18,7 @@ #include #include #include -#include #include -#include #include #include #include @@ -332,7 +330,7 @@ static struct vector_fds *user_init_unix_fds(struct arglist *ifspec, int id) } switch (id) { case ID_BESS: - if (connect(fd, remote_addr, sizeof(struct sockaddr_un)) < 0) { + if (connect(fd, (const struct sockaddr *) remote_addr, sizeof(struct sockaddr_un)) < 0) { printk(UM_KERN_ERR "bess open:cannot connect to %s %i", remote_addr->sun_path, -errno); goto unix_cleanup; } diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c index ecf2f390fad2..07327425d06e 100644 --- a/arch/um/os-Linux/util.c +++ b/arch/um/os-Linux/util.c @@ -10,7 +10,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/arch/x86/um/user-offsets.c b/arch/x86/um/user-offsets.c index c51dd8363d25..bae61554abcc 100644 --- a/arch/x86/um/user-offsets.c +++ b/arch/x86/um/user-offsets.c @@ -2,7 +2,7 @@ #include #include #include -#include +#include #include #include #define __FRAME_OFFSETS -- 2.20.1
[PATCH v2 0/3] um: allow static linking for non-glibc libc implementations
This is a continuation of [1]. Since I was able to produce a working UML binary with UML_NET_VECTOR linked with musl with the changes included in the patches here. I was compiling on Arch Linux, so hopefully all the latest versions of the compiler, libraries and binutils. I also tested allyesconfig with both musl and glibc. The compilation succeeds with both, however both binaries (glibc one being dynamically linked) segfault on start. This is probably of some incompatible config option/module being included and not related to musl/glibc. [1]: https://patchwork.ozlabs.org/project/linux-um/patch/20200624212319.403689-1-ig...@cloudflare.com/ Ignat Korchagin (3): um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS um: some fixes to build UML with musl um: allow static linking for non-glibc implementations arch/um/Kconfig | 2 +- arch/um/drivers/Kconfig | 3 --- arch/um/drivers/daemon_user.c | 1 + arch/um/drivers/pcap_user.c | 12 ++-- arch/um/drivers/slip_user.c | 2 +- arch/um/drivers/vector_user.c | 4 +--- arch/um/os-Linux/util.c | 2 +- arch/x86/um/user-offsets.c| 2 +- init/Kconfig | 6 ++ scripts/cc-can-link.sh| 5 +++-- 10 files changed, 21 insertions(+), 18 deletions(-) -- 2.20.1
Re: [PATCH v2] dm crypt: add flags to optionally bypass dm-crypt workqueues
On Tue, Jun 30, 2020 at 3:51 AM Damien Le Moal wrote: > > On 2020/06/27 6:03, Ignat Korchagin wrote: > > This is a follow up from [1]. Consider the following script: > > > > sudo modprobe brd rd_nr=1 rd_size=4194304 > > > > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \ > > sudo dmsetup create eram0 > > > > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 > > no_write_workqueue' | \ > > sudo dmsetup create eram0-inline-write > > > > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 > > no_read_workqueue' | \ > > sudo dmsetup create eram0-inline-read > > > > devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read " > > devices+="/dev/mapper/eram0-inline-write" > > > > for dev in $devices; do > > echo "reading from $dev" > > sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \ > > --loops=100 --runtime=3m --name=plain | grep READ > > done > > > > for dev in $devices; do > > echo "writing to $dev" > > sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \ > > --loops=100 --runtime=3m --name=plain | grep WRITE > > done > > > > This script creates a ramdisk (to eliminate hardware bias in the benchmark) > > and > > three dm-crypt instances on top. All dm-crypt instances use the NULL cipher > > to eliminate potentially expensive crypto bias (the NULL cipher just uses > > memcpy > > for "encyption"). The first instance is the current dm-crypt implementation > > from > > 5.8-rc2, the two others have new optional flags enabled, which bypass > > kcryptd > > workqueues for reads and writes respectively and write sorting for writes. > > On > > my VM (Debian in VirtualBox with 4 cores on 2.8 GHz Quad-Core Intel Core > > i7) I > > get the following output (formatted for better readability): > > > > reading from /dev/ram0 > >READ: bw=508MiB/s (533MB/s), 508MiB/s-508MiB/s (533MB/s-533MB/s), > > io=89.3GiB (95.9GB), run=18-18msec > > > > reading from /dev/mapper/eram0 > >READ: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), > > io=14.2GiB (15.2GB), run=18-18msec > > > > reading from /dev/mapper/eram0-inline-read > >READ: bw=295MiB/s (309MB/s), 295MiB/s-295MiB/s (309MB/s-309MB/s), > > io=51.8GiB (55.6GB), run=18-18msec > > > > reading from /dev/mapper/eram0-inline-write > >READ: bw=114MiB/s (120MB/s), 114MiB/s-114MiB/s (120MB/s-120MB/s), > > io=20.1GiB (21.5GB), run=18-18msec > > > > writing to /dev/ram0 > > WRITE: bw=516MiB/s (541MB/s), 516MiB/s-516MiB/s (541MB/s-541MB/s), > > io=90.7GiB (97.4GB), run=180001-180001msec > > > > writing to /dev/mapper/eram0 > > WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), > > io=7271MiB (7624MB), run=180001-180001msec > > > > writing to /dev/mapper/eram0-inline-read > > WRITE: bw=38.9MiB/s (40.8MB/s), 38.9MiB/s-38.9MiB/s (40.8MB/s-40.8MB/s), > > io=7000MiB (7340MB), run=180001-180001msec > > > > writing to /dev/mapper/eram0-inline-write > > WRITE: bw=277MiB/s (290MB/s), 277MiB/s-277MiB/s (290MB/s-290MB/s), > > io=48.6GiB (52.2GB), run=18-18msec > > > > Current dm-crypt implementation creates a significant IO performance > > overhead > > (at least on small IO block sizes) for both latency and throughput. We > > suspect > > offloading IO request processing into workqueues and async threads is more > > harmful these days with the modern fast storage. I also did some digging > > into > > the dm-crypt git history and much of this async processing is not needed > > anymore, because the reasons it was added are mostly gone from the kernel. > > More > > details can be found in [2] (see "Git archeology" section). > > > > This change adds no_(read|write)_workqueue flags separately for read and > > write > > BIOs, which direct dm-crypt not to offload crypto operations into kcryptd > > workqueues and process everything inline. In addition, writes are not > > buffered > > to be sorted in the dm-crypt red-black tree, but dispatched immediately. For > > cases, where crypto operations cannot happen inline (hard interrupt context, > > for example the read path of some NVME drivers), we offload the work to a > > tasklet rather than a workqueue. > > > > These flags ensure inline BIO processing in the
Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
On Wed, Jun 24, 2020 at 5:24 PM Eric Biggers wrote: > > On Wed, Jun 24, 2020 at 09:24:07AM +0100, Ignat Korchagin wrote: > > On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers wrote: > > > > > > On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote: > > > > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. > > > > This is > > > > especially visible on busy systems with many processes/threads. > > > > Moreover, most > > > > Crypto API implementaions are async, that is they offload crypto > > > > operations on > > > > their own, so this dm-crypt offloading is excessive. > > > > > > This really should say "some Crypto API implementations are async" > > > instead of > > > "most Crypto API implementations are async". > > > > The most accurate would probably be: most hardware-accelerated Crypto > > API implementations are async > > > > > Notably, the AES-NI implementation of AES-XTS is synchronous if you call > > > it in a > > > context where SIMD instructions are usable. It's only asynchronous when > > > SIMD is > > > not usable. (This seems to have been missed in your blog post.) > > > > No, it was not. This is exactly why we made xts-proxy Crypto API > > module as a second patch. But it seems now it does not make a big > > difference if a used Crypto API implementation is synchronous as well > > (based on some benchmarks outlined in the cover letter to this patch). > > I think the v2 of this patch will not require a synchronous Crypto > > API. This is probably a right thing to do, as the "inline" flag should > > control the way how dm-crypt itself handles requests, not how Crypto > > API handles requests. If a user wants to ensure a particular > > synchronous Crypto API implementation, they can already reconfigure > > dm-crypt and specify the implementation with a "capi:" prefix in the > > the dm table description. > > I think you're missing the point. Although xts-aes-aesni has the > CRYPTO_ALG_ASYNC bit set, the actual implementation processes the request > synchronously if SIMD instructions are currently usable. That's always the > case > in dm-crypt, as far as I can tell. This algorithm has the ASYNC flag only > because it's not synchronous when called in hardIRQ context. > > That's why your "xts-proxy" doesn't make a difference, and why it's misleading > to suggest that the crypto API is doing its own queueing when you're primarily > talking about xts-aes-aesni. The crypto API definitely can do its own > queueing, > mainly with hardware drivers. But it doesn't in this common and relevant > case. I think we're talking about the same things but from different points of view. I would like to clarify that the whole post and this change does not have the intention to focus on aesni (or any x86-specific crypto optimizations). In fact it is quite the opposite: we want to optimize the generic dm-crypt regardless of which crypto is used (that's why I just used a NULL cipher in the cover letter). We also have some arm64 machines [1] and I bet they would benefit here as well. The important point my post tries to make is that the original workqueue offloading in dm-crypt was added because the Crypto API was synchronous back in the day and, exactly as you say, you may not be able to use some hw-accelerated crypto in hard IRQ context as well as doing non-hw crypto in hard IRQ context is a bad idea. Now, most Crypto API are smart enough to figure out on their own if they should process the request inline or offload it to a workqueue, so the workarounds in the dm-crypt itself most likely are not needed. Though, the generic Crypto API "cipher walk" function does refuse to "walk" the buffers in hard IRQ context, so the "tasklet" functionality is still required. But from the dm-crypt perspective - it should not take into account if a particular xts-aes-aesni implementation is MOSTLY synchronous - those are details of the implementation of this particular cipher dm-crypt has no visibility into. So it would be right to say in my opinion if the cipher has the CRYPTO_ALG_ASYNC flag set - it can offload the crypto request to a workqueue at any time. How often does it do it - that's another story and probably should be reviewed elsewhere, if it does it too often. Ignat [1]: https://blog.cloudflare.com/arm-takes-wing/ > - Eric
[RFC PATCH] Revert "um: Make CONFIG_STATIC_LINK actually static"
This reverts commit 3363179385629c1804ea846f4e72608c2201a81e. This change is too restrictive. I've been running UML statically linked kernel with UML_NET_VECTOR networking in a docker "FROM: scratch" container just fine. As long as we don't reference network peers by hostname and use only IP addresses, NSS is not needed, so not used. In other words, it is possible to have statically linked UML and UML_NET_VECTOR (and other networking types) and use it, although with some restrictions, so let's not disable it. Additionally, it should be at least theoretically possible to use another libc (like musl, bionic etc) for static linking. I was able with some hacks to compile UML against musl, although the executable segfaults for now. But this option prevents even the research to be done. Signed-off-by: Ignat Korchagin --- arch/um/Kconfig | 8 +--- arch/um/drivers/Kconfig | 3 --- 2 files changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 96ab7026b037..817a4c838a06 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -62,12 +62,9 @@ config NR_CPUS source "arch/$(HEADER_ARCH)/um/Kconfig" -config FORBID_STATIC_LINK - bool - config STATIC_LINK bool "Force a static link" - depends on !FORBID_STATIC_LINK + default n help This option gives you the ability to force a static link of UML. Normally, UML is linked as a shared binary. This is inconvenient for @@ -76,9 +73,6 @@ config STATIC_LINK Additionally, this option enables using higher memory spaces (up to 2.75G) for UML. - NOTE: This option is incompatible with some networking features which - depend on features that require being dynamically loaded (like NSS). - config LD_SCRIPT_STATIC bool default y diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig index 9160ead56e33..72d417055782 100644 --- a/arch/um/drivers/Kconfig +++ b/arch/um/drivers/Kconfig @@ -234,7 +234,6 @@ config UML_NET_DAEMON config UML_NET_VECTOR bool "Vector I/O high performance network devices" depends on UML_NET - select FORBID_STATIC_LINK help This User-Mode Linux network driver uses multi-message send and receive functions. The host running the UML guest must have @@ -246,7 +245,6 @@ config UML_NET_VECTOR config UML_NET_VDE bool "VDE transport (obsolete)" depends on UML_NET - select FORBID_STATIC_LINK help This User-Mode Linux network transport allows one or more running UMLs on a single host to communicate with each other and also @@ -294,7 +292,6 @@ config UML_NET_MCAST config UML_NET_PCAP bool "pcap transport (obsolete)" depends on UML_NET - select FORBID_STATIC_LINK help The pcap transport makes a pcap packet stream on the host look like an ethernet device inside UML. This is useful for making -- 2.20.1
Re: [RFC PATCH 0/1] dm-crypt excessive overhead
Do you think it may be better to break it in two flags: one for read path and one for write? So, depending on the needs and workflow these could be enabled independently? Regards, Ignat On Tue, Jun 23, 2020 at 4:01 PM Mike Snitzer wrote: > > On Sun, Jun 21 2020 at 8:45pm -0400, > Damien Le Moal wrote: > > > On 2020/06/20 1:56, Mike Snitzer wrote: > > > On Fri, Jun 19 2020 at 12:41pm -0400, > > > Ignat Korchagin wrote: > > > > > >> This is a follow up from the long-forgotten [1], but with some more > > >> convincing > > >> evidence. Consider the following script: > > >> > > >> #!/bin/bash -e > > >> > > >> # create 4G ramdisk > > >> sudo modprobe brd rd_nr=1 rd_size=4194304 > > >> > > >> # create a dm-crypt device with NULL cipher on top of /dev/ram0 > > >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo > > >> dmsetup create eram0 > > >> > > >> # create a dm-crypt device with NULL cipher and custom force_inline flag > > >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 > > >> force_inline' | sudo dmsetup create inline-eram0 > > >> > > >> # read all data from /dev/ram0 > > >> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum > > >> > > >> # read the same data from /dev/mapper/eram0 > > >> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum > > >> > > >> # read the same data from /dev/mapper/inline-eram0 > > >> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum > > >> > > >> This script creates a ramdisk (to eliminate hardware bias in the > > >> benchmark) and > > >> two dm-crypt instances on top. Both dm-crypt instances use the NULL > > >> cipher > > >> to eliminate potentially expensive crypto bias (the NULL cipher just > > >> uses memcpy > > >> for "encyption"). The first instance is the current dm-crypt > > >> implementation from > > >> 5.8-rc1, the second is the dm-crypt instance with a custom new flag > > >> enabled from > > >> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 > > >> cores > > >> on 2.8 GHz Quad-Core Intel Core i7) I get the following output > > >> (formatted for > > >> better readability): > > >> > > >> # plain ram0 > > >> 1048576+0 records in > > >> 1048576+0 records out > > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s > > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca - > > >> > > >> # eram0 (current dm-crypt) > > >> 1048576+0 records in > > >> 1048576+0 records out > > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s > > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca - > > >> > > >> # inline-eram0 (patched dm-crypt) > > >> 1048576+0 records in > > >> 1048576+0 records out > > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s > > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca - > > >> > > >> As we can see, current dm-crypt implementation creates a significant IO > > >> performance overhead (at least on small IO block sizes) for both latency > > >> and > > >> throughput. We suspect offloading IO request processing into workqueues > > >> and > > >> async threads is more harmful these days with the modern fast storage. I > > >> also > > >> did some digging into the dm-crypt git history and much of this async > > >> processing > > >> is not needed anymore, because the reasons it was added are mostly gone > > >> from the > > >> kernel. More details can be found in [2] (see "Git archeology" section). > > >> > > >> We have been running the attached patch on different hardware > > >> generations in > > >> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far > > >> were very > > >> happy with the performance benefits. > > >> > > >> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html > > >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ > > >> > > >> Ignat Korchagin
Re: [RFC PATCH 0/1] dm-crypt excessive overhead
On Tue, Jun 23, 2020 at 4:34 PM Mike Snitzer wrote: > > On Fri, Jun 19 2020 at 9:23pm -0400, > Herbert Xu wrote: > > > On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote: > > > > > > I'm looking at this and I'd like to know why does the crypto API fail in > > > hard-irq context and why does it work in tasklet context. What's the exact > > > reason behind this? > > > > You're not supposed to do any real work in IRQ handlers. All > > the substantial work should be postponed to softirq context. > > > > Why do you need to do work in hard IRQ context? > > Ignat, think you may have missed Herbert's question? > > My understanding is that you're doing work in hard IRQ context (via > tasklet) precisely to avoid overhead of putting to a workqueue? Did > you try using a workqueue and it didn't work adequately? If so, do you > have a handle on why that is? E.g. was it due to increased latency? or > IO completion occurring on different cpus that submitted (are you > leaning heavily on blk-mq's ability to pin IO completion to same cpu as > IO was submitted?) > > I'm fishing here but I'd just like to tease out the details for _why_ > you need to do work from hard IRQ via tasklet so that I can potentially > defend it if needed. I may be misunderstanding the terminology, but tasklets execute in soft IRQ, don't they? What we care about is to execute the decryption as fast as possible, but we can't do it in a hard IRQ context (that is, the interrupt context where other interrupts are disabled). As far as I understand, tasklets are executed right after the hard IRQ context, but with interrupts enabled - which is the first safe-ish place to do more lengthy processing without the risk of missing an interrupt. Workqueues instead of tasklets - is the way how it is mostly implemented now. But that creates additional IO latency, most probably due to the fact that we're introducing CPU scheduling latency into the overall read IO latency. This is confirmed by the fact that our busier production systems have much worse and more important - spiky and unstable p99 read latency, which somewhat correlates to high CPU scheduling latency reported by metrics. So, by inlining crypto or using a tasklet we're effectively prioritising IO encryption/decryption. What we want to avoid is mixing unpredicted additional latency from an unrelated subsystem (CPU scheduler), because our expectation is that the total latency should be real HW io latency + crypto operation latency (which is usually quite stable). I hope this makes sense. > > Thanks, > Mike >
Re: [PATCH v2 2/3] um: some fixes to build UML with musl
On Tue, Jul 14, 2020 at 9:40 AM Anton Ivanov wrote: > > > On 04/07/2020 09:52, Ignat Korchagin wrote: > > musl toolchain and headers are a bit more strict. These fixes enable > > building > > UML with musl as well as seem not to break on glibc. > > > > Signed-off-by: Ignat Korchagin > > --- > > arch/um/drivers/daemon_user.c | 1 + > > arch/um/drivers/pcap_user.c | 12 ++-- > > arch/um/drivers/slip_user.c | 2 +- > > arch/um/drivers/vector_user.c | 4 +--- > > arch/um/os-Linux/util.c | 2 +- > > arch/x86/um/user-offsets.c| 2 +- > > 6 files changed, 11 insertions(+), 12 deletions(-) > > > > diff --git a/arch/um/drivers/daemon_user.c b/arch/um/drivers/daemon_user.c > > index 3695821d06a2..785baedc3555 100644 > > --- a/arch/um/drivers/daemon_user.c > > +++ b/arch/um/drivers/daemon_user.c > > @@ -7,6 +7,7 @@ > >*/ > > > > #include > > +#include > > #include > > #include > > #include > > diff --git a/arch/um/drivers/pcap_user.c b/arch/um/drivers/pcap_user.c > > index bbd20638788a..52ddda3e3b10 100644 > > --- a/arch/um/drivers/pcap_user.c > > +++ b/arch/um/drivers/pcap_user.c > > @@ -32,7 +32,7 @@ static int pcap_user_init(void *data, void *dev) > > return 0; > > } > > > > -static int pcap_open(void *data) > > +static int pcap_user_open(void *data) > > This change in the function name was introduced on purpose to avoid name > clash in some version of libpcap which export pcap_open Yes > > > > { > > struct pcap_data *pri = data; > > __u32 netmask; > > @@ -44,14 +44,14 @@ static int pcap_open(void *data) > > if (pri->filter != NULL) { > > err = dev_netmask(pri->dev, &netmask); > > if (err < 0) { > > - printk(UM_KERN_ERR "pcap_open : dev_netmask > > failed\n"); > > + printk(UM_KERN_ERR "pcap_user_open : dev_netmask > > failed\n"); > > return -EIO; > > } > > > > pri->compiled = uml_kmalloc(sizeof(struct bpf_program), > > UM_GFP_KERNEL); > > if (pri->compiled == NULL) { > > - printk(UM_KERN_ERR "pcap_open : kmalloc failed\n"); > > + printk(UM_KERN_ERR "pcap_user_open : kmalloc > > failed\n"); > > return -ENOMEM; > > } > > > > @@ -59,14 +59,14 @@ static int pcap_open(void *data) > > (struct bpf_program *) pri->compiled, > > pri->filter, pri->optimize, netmask); > > if (err < 0) { > > - printk(UM_KERN_ERR "pcap_open : pcap_compile failed - > > " > > + printk(UM_KERN_ERR "pcap_user_open : pcap_compile > > failed - " > > "'%s'\n", pcap_geterr(pri->pcap)); > > goto out; > > } > > > > err = pcap_setfilter(pri->pcap, pri->compiled); > > if (err < 0) { > > - printk(UM_KERN_ERR "pcap_open : pcap_setfilter " > > + printk(UM_KERN_ERR "pcap_user_open : pcap_setfilter " > > "failed - '%s'\n", pcap_geterr(pri->pcap)); > > goto out; > > } > > @@ -127,7 +127,7 @@ int pcap_user_read(int fd, void *buffer, int len, > > struct pcap_data *pri) > > > > const struct net_user_info pcap_user_info = { > > .init = pcap_user_init, > > - .open = pcap_open, > > + .open = pcap_user_open, > > .close = NULL, > > .remove = pcap_remove, > > .add_address= NULL, > > diff --git a/arch/um/drivers/slip_user.c b/arch/um/drivers/slip_user.c > > index 8016d32b6809..482a19c5105c 100644 > > --- a/arch/um/drivers/slip_user.c > > +++ b/arch/um/drivers/slip_user.c > > @@ -9,7 +9,7 @@ > > #include > > #include > > #include > > -#include > > +#include > > #include > > #include > > #include > > diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c > > index c4a0f26b2824..45d4164ad355 100644 > > ---
Re: [PATCH v2 3/3] um: allow static linking for non-glibc implementations
On Wed, Jul 15, 2020 at 9:44 AM Brendan Higgins wrote: > > On Sat, Jul 4, 2020 at 1:52 AM Ignat Korchagin wrote: > > > > It is possible to produce a statically linked UML binary with > > UML_NET_VECTOR, > > UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc > > implementations, which do not rely on NSS, such as musl. > > > > Allow static linking in this case. > > > > Signed-off-by: Ignat Korchagin > > One minor issue below. Other than that: > > Reviewed-by: Brendan Higgins > > > --- > > arch/um/Kconfig | 2 +- > > arch/um/drivers/Kconfig | 3 --- > > 2 files changed, 1 insertion(+), 4 deletions(-) > > > > diff --git a/arch/um/Kconfig b/arch/um/Kconfig > > index 9318dc6d1a0c..af7ed63f9c74 100644 > > --- a/arch/um/Kconfig > > +++ b/arch/um/Kconfig > > @@ -67,7 +67,7 @@ config FORBID_STATIC_LINK > > Doesn't look like FORBID_STATIC_LINK is used anymore, so you should > probably drop it as well. Right, good catch! I will repost the series with this adjusted as well cc Masahiro Yamada as mentioned for +1 on the changes to can-link.sh script. Seems I also need to rebase now to accommodate changes in b816b3db15f68690ee72a4a414624f8e82942b25 ("kbuild: fix CONFIG_CC_CAN_LINK(_STATIC) for cross-compilation with Clang") > With the preceding changes, in this patchset, you can revert my patch > like you did in the RFC - or not, your choice. I am not offended by > people reverting my commits. I just don't like it when people break > allyesconfig. :-) > > > config STATIC_LINK > > bool "Force a static link" > > - depends on !FORBID_STATIC_LINK > > + depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || (!UML_NET_VECTOR > > && !UML_NET_VDE && !UML_NET_PCAP) > > help > > This option gives you the ability to force a static link of UML. > > Normally, UML is linked as a shared binary. This is inconvenient > > for > > diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig > > index 9160ead56e33..72d417055782 100644 > > --- a/arch/um/drivers/Kconfig > > +++ b/arch/um/drivers/Kconfig > > @@ -234,7 +234,6 @@ config UML_NET_DAEMON > > config UML_NET_VECTOR > > bool "Vector I/O high performance network devices" > > depends on UML_NET > > - select FORBID_STATIC_LINK > > help > > This User-Mode Linux network driver uses multi-message send > > and receive functions. The host running the UML guest must have > > @@ -246,7 +245,6 @@ config UML_NET_VECTOR > > config UML_NET_VDE > > bool "VDE transport (obsolete)" > > depends on UML_NET > > - select FORBID_STATIC_LINK > > help > > This User-Mode Linux network transport allows one or more running > > UMLs on a single host to communicate with each other and also > > @@ -294,7 +292,6 @@ config UML_NET_MCAST > > config UML_NET_PCAP > > bool "pcap transport (obsolete)" > > depends on UML_NET > > - select FORBID_STATIC_LINK > > help > > The pcap transport makes a pcap packet stream on the host look > > like an ethernet device inside UML. This is useful for making > > -- > > 2.20.1 > >
[PATCH v3 0/3] um: allow static linking for non-glibc libc implementations
Changes from v2: * rebased to accomodate b816b3db15f68690ee72a4a414624f8e82942b25 ("kbuild: fix CONFIG_CC_CAN_LINK(_STATIC) for cross-compilation with Clang") * removed unused FORBID_STATIC_LINK config option * cc-ed Masahiro Yamada for can-link.sh approve This is a continuation of [1]. Since I was able to produce a working UML binary with UML_NET_VECTOR linked with musl with the changes included in the patches here. I was compiling on Arch Linux, so hopefully all the latest versions of the compiler, libraries and binutils. I also tested allyesconfig with both musl and glibc. The compilation succeeds with both, however both binaries (glibc one being dynamically linked) segfault on start. This is probably of some incompatible config option/module being included and not related to musl/glibc. [1]: https://patchwork.ozlabs.org/project/linux-um/patch/20200624212319.403689-1-ig...@cloudflare.com/ Ignat Korchagin (3): um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS um: some fixes to build UML with musl um: allow static linking for non-glibc implementations arch/um/Kconfig | 5 + arch/um/drivers/Kconfig | 3 --- arch/um/drivers/daemon_user.c | 1 + arch/um/drivers/pcap_user.c | 12 ++-- arch/um/drivers/slip_user.c | 2 +- arch/um/drivers/vector_user.c | 4 +--- arch/um/os-Linux/util.c | 2 +- arch/x86/um/user-offsets.c| 2 +- init/Kconfig | 6 ++ scripts/cc-can-link.sh| 5 +++-- 10 files changed, 21 insertions(+), 21 deletions(-) -- 2.20.1
[PATCH v3 2/3] um: some fixes to build UML with musl
musl toolchain and headers are a bit more strict. These fixes enable building UML with musl as well as seem not to break on glibc. Signed-off-by: Ignat Korchagin Tested-by: Brendan Higgins --- arch/um/drivers/daemon_user.c | 1 + arch/um/drivers/pcap_user.c | 12 ++-- arch/um/drivers/slip_user.c | 2 +- arch/um/drivers/vector_user.c | 4 +--- arch/um/os-Linux/util.c | 2 +- arch/x86/um/user-offsets.c| 2 +- 6 files changed, 11 insertions(+), 12 deletions(-) diff --git a/arch/um/drivers/daemon_user.c b/arch/um/drivers/daemon_user.c index 3695821d06a2..785baedc3555 100644 --- a/arch/um/drivers/daemon_user.c +++ b/arch/um/drivers/daemon_user.c @@ -7,6 +7,7 @@ */ #include +#include #include #include #include diff --git a/arch/um/drivers/pcap_user.c b/arch/um/drivers/pcap_user.c index bbd20638788a..52ddda3e3b10 100644 --- a/arch/um/drivers/pcap_user.c +++ b/arch/um/drivers/pcap_user.c @@ -32,7 +32,7 @@ static int pcap_user_init(void *data, void *dev) return 0; } -static int pcap_open(void *data) +static int pcap_user_open(void *data) { struct pcap_data *pri = data; __u32 netmask; @@ -44,14 +44,14 @@ static int pcap_open(void *data) if (pri->filter != NULL) { err = dev_netmask(pri->dev, &netmask); if (err < 0) { - printk(UM_KERN_ERR "pcap_open : dev_netmask failed\n"); + printk(UM_KERN_ERR "pcap_user_open : dev_netmask failed\n"); return -EIO; } pri->compiled = uml_kmalloc(sizeof(struct bpf_program), UM_GFP_KERNEL); if (pri->compiled == NULL) { - printk(UM_KERN_ERR "pcap_open : kmalloc failed\n"); + printk(UM_KERN_ERR "pcap_user_open : kmalloc failed\n"); return -ENOMEM; } @@ -59,14 +59,14 @@ static int pcap_open(void *data) (struct bpf_program *) pri->compiled, pri->filter, pri->optimize, netmask); if (err < 0) { - printk(UM_KERN_ERR "pcap_open : pcap_compile failed - " + printk(UM_KERN_ERR "pcap_user_open : pcap_compile failed - " "'%s'\n", pcap_geterr(pri->pcap)); goto out; } err = pcap_setfilter(pri->pcap, pri->compiled); if (err < 0) { - printk(UM_KERN_ERR "pcap_open : pcap_setfilter " + printk(UM_KERN_ERR "pcap_user_open : pcap_setfilter " "failed - '%s'\n", pcap_geterr(pri->pcap)); goto out; } @@ -127,7 +127,7 @@ int pcap_user_read(int fd, void *buffer, int len, struct pcap_data *pri) const struct net_user_info pcap_user_info = { .init = pcap_user_init, - .open = pcap_open, + .open = pcap_user_open, .close = NULL, .remove = pcap_remove, .add_address= NULL, diff --git a/arch/um/drivers/slip_user.c b/arch/um/drivers/slip_user.c index 8016d32b6809..482a19c5105c 100644 --- a/arch/um/drivers/slip_user.c +++ b/arch/um/drivers/slip_user.c @@ -9,7 +9,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c index c4a0f26b2824..45d4164ad355 100644 --- a/arch/um/drivers/vector_user.c +++ b/arch/um/drivers/vector_user.c @@ -18,9 +18,7 @@ #include #include #include -#include #include -#include #include #include #include @@ -332,7 +330,7 @@ static struct vector_fds *user_init_unix_fds(struct arglist *ifspec, int id) } switch (id) { case ID_BESS: - if (connect(fd, remote_addr, sizeof(struct sockaddr_un)) < 0) { + if (connect(fd, (const struct sockaddr *) remote_addr, sizeof(struct sockaddr_un)) < 0) { printk(UM_KERN_ERR "bess open:cannot connect to %s %i", remote_addr->sun_path, -errno); goto unix_cleanup; } diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c index ecf2f390fad2..07327425d06e 100644 --- a/arch/um/os-Linux/util.c +++ b/arch/um/os-Linux/util.c @@ -10,7 +10,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/arch/x86/um/user-offsets.c b/arch/x86/um/user-offsets.c index c51dd8363d25..bae61554abcc 100644 --- a/arch/x86/um/user-offsets.c +++ b/arch/x86/um/user-offsets.c @@ -2,7 +2,7 @@ #include #include #include -#include +#include #include #include #define __FRAME_OFFSETS -- 2.20.1
[PATCH v3 3/3] um: allow static linking for non-glibc implementations
It is possible to produce a statically linked UML binary with UML_NET_VECTOR, UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc implementations, which do not rely on NSS, such as musl. Allow static linking in this case. Signed-off-by: Ignat Korchagin Reviewed-by: Brendan Higgins Tested-by: Brendan Higgins --- arch/um/Kconfig | 5 + arch/um/drivers/Kconfig | 3 --- 2 files changed, 1 insertion(+), 7 deletions(-) diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 9318dc6d1a0c..beb98b3b9f75 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -62,12 +62,9 @@ config NR_CPUS source "arch/$(HEADER_ARCH)/um/Kconfig" -config FORBID_STATIC_LINK - bool - config STATIC_LINK bool "Force a static link" - depends on !FORBID_STATIC_LINK + depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || (!UML_NET_VECTOR && !UML_NET_VDE && !UML_NET_PCAP) help This option gives you the ability to force a static link of UML. Normally, UML is linked as a shared binary. This is inconvenient for diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig index 9160ead56e33..72d417055782 100644 --- a/arch/um/drivers/Kconfig +++ b/arch/um/drivers/Kconfig @@ -234,7 +234,6 @@ config UML_NET_DAEMON config UML_NET_VECTOR bool "Vector I/O high performance network devices" depends on UML_NET - select FORBID_STATIC_LINK help This User-Mode Linux network driver uses multi-message send and receive functions. The host running the UML guest must have @@ -246,7 +245,6 @@ config UML_NET_VECTOR config UML_NET_VDE bool "VDE transport (obsolete)" depends on UML_NET - select FORBID_STATIC_LINK help This User-Mode Linux network transport allows one or more running UMLs on a single host to communicate with each other and also @@ -294,7 +292,6 @@ config UML_NET_MCAST config UML_NET_PCAP bool "pcap transport (obsolete)" depends on UML_NET - select FORBID_STATIC_LINK help The pcap transport makes a pcap packet stream on the host look like an ethernet device inside UML. This is useful for making -- 2.20.1
[PATCH v3 1/3] um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
For statically linked UML build it is important to take into account the standard C-library implementation. Some implementations, notably glibc have caveats: even when linked statically, the final program might require some runtime dependencies, if certain functions are used within the code. Consider the following program: int main(void) { getpwent(); return 0; } Compiling this program and linking statically with glibc produces the following warning from the linker: /usr/sbin/ld: /tmp/ccuthw1o.o: in function `main': test.c:(.text+0x5): warning: Using 'getpwent' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking We will use the flag to detect such C-library implementation build time and possibly disable static linking for UML to avoid producing a binary with unexpected behaviour and dependencies. Signed-off-by: Ignat Korchagin Reviewed-by: Brendan Higgins Tested-by: Brendan Higgins --- init/Kconfig | 6 ++ scripts/cc-can-link.sh | 5 +++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index 0498af567f70..0a1ec56c9f33 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -57,6 +57,12 @@ config CC_CAN_LINK_STATIC default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(m64-flag) -static) if 64BIT default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(m32-flag) -static) +config CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS + bool + depends on UML && CC_CAN_LINK_STATIC + default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(m64-flag) -static -Xlinker --fatal-warnings) if 64BIT + default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(m32-flag) -static -Xlinker --fatal-warnings) + config CC_HAS_ASM_GOTO def_bool $(success,$(srctree)/scripts/gcc-goto.sh $(CC)) diff --git a/scripts/cc-can-link.sh b/scripts/cc-can-link.sh index 6efcead31989..e5011a46103e 100755 --- a/scripts/cc-can-link.sh +++ b/scripts/cc-can-link.sh @@ -2,10 +2,11 @@ # SPDX-License-Identifier: GPL-2.0 cat << "END" | $@ -x c - -o /dev/null >/dev/null 2>&1 -#include +#include +#include int main(void) { - printf(""); + getpwent(); return 0; } END -- 2.20.1
[PATCH v4 2/3] um: some fixes to build UML with musl
musl toolchain and headers are a bit more strict. These fixes enable building UML with musl as well as seem not to break on glibc. Signed-off-by: Ignat Korchagin Tested-by: Brendan Higgins --- arch/um/drivers/daemon_user.c | 1 + arch/um/drivers/pcap_user.c | 12 ++-- arch/um/drivers/slip_user.c | 2 +- arch/um/drivers/vector_user.c | 4 +--- arch/um/os-Linux/util.c | 2 +- arch/x86/um/user-offsets.c| 2 +- 6 files changed, 11 insertions(+), 12 deletions(-) diff --git a/arch/um/drivers/daemon_user.c b/arch/um/drivers/daemon_user.c index 3695821d06a2..785baedc3555 100644 --- a/arch/um/drivers/daemon_user.c +++ b/arch/um/drivers/daemon_user.c @@ -7,6 +7,7 @@ */ #include +#include #include #include #include diff --git a/arch/um/drivers/pcap_user.c b/arch/um/drivers/pcap_user.c index bbd20638788a..52ddda3e3b10 100644 --- a/arch/um/drivers/pcap_user.c +++ b/arch/um/drivers/pcap_user.c @@ -32,7 +32,7 @@ static int pcap_user_init(void *data, void *dev) return 0; } -static int pcap_open(void *data) +static int pcap_user_open(void *data) { struct pcap_data *pri = data; __u32 netmask; @@ -44,14 +44,14 @@ static int pcap_open(void *data) if (pri->filter != NULL) { err = dev_netmask(pri->dev, &netmask); if (err < 0) { - printk(UM_KERN_ERR "pcap_open : dev_netmask failed\n"); + printk(UM_KERN_ERR "pcap_user_open : dev_netmask failed\n"); return -EIO; } pri->compiled = uml_kmalloc(sizeof(struct bpf_program), UM_GFP_KERNEL); if (pri->compiled == NULL) { - printk(UM_KERN_ERR "pcap_open : kmalloc failed\n"); + printk(UM_KERN_ERR "pcap_user_open : kmalloc failed\n"); return -ENOMEM; } @@ -59,14 +59,14 @@ static int pcap_open(void *data) (struct bpf_program *) pri->compiled, pri->filter, pri->optimize, netmask); if (err < 0) { - printk(UM_KERN_ERR "pcap_open : pcap_compile failed - " + printk(UM_KERN_ERR "pcap_user_open : pcap_compile failed - " "'%s'\n", pcap_geterr(pri->pcap)); goto out; } err = pcap_setfilter(pri->pcap, pri->compiled); if (err < 0) { - printk(UM_KERN_ERR "pcap_open : pcap_setfilter " + printk(UM_KERN_ERR "pcap_user_open : pcap_setfilter " "failed - '%s'\n", pcap_geterr(pri->pcap)); goto out; } @@ -127,7 +127,7 @@ int pcap_user_read(int fd, void *buffer, int len, struct pcap_data *pri) const struct net_user_info pcap_user_info = { .init = pcap_user_init, - .open = pcap_open, + .open = pcap_user_open, .close = NULL, .remove = pcap_remove, .add_address= NULL, diff --git a/arch/um/drivers/slip_user.c b/arch/um/drivers/slip_user.c index 8016d32b6809..482a19c5105c 100644 --- a/arch/um/drivers/slip_user.c +++ b/arch/um/drivers/slip_user.c @@ -9,7 +9,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c index c4a0f26b2824..45d4164ad355 100644 --- a/arch/um/drivers/vector_user.c +++ b/arch/um/drivers/vector_user.c @@ -18,9 +18,7 @@ #include #include #include -#include #include -#include #include #include #include @@ -332,7 +330,7 @@ static struct vector_fds *user_init_unix_fds(struct arglist *ifspec, int id) } switch (id) { case ID_BESS: - if (connect(fd, remote_addr, sizeof(struct sockaddr_un)) < 0) { + if (connect(fd, (const struct sockaddr *) remote_addr, sizeof(struct sockaddr_un)) < 0) { printk(UM_KERN_ERR "bess open:cannot connect to %s %i", remote_addr->sun_path, -errno); goto unix_cleanup; } diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c index ecf2f390fad2..07327425d06e 100644 --- a/arch/um/os-Linux/util.c +++ b/arch/um/os-Linux/util.c @@ -10,7 +10,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/arch/x86/um/user-offsets.c b/arch/x86/um/user-offsets.c index c51dd8363d25..bae61554abcc 100644 --- a/arch/x86/um/user-offsets.c +++ b/arch/x86/um/user-offsets.c @@ -2,7 +2,7 @@ #include #include #include -#include +#include #include #include #define __FRAME_OFFSETS -- 2.20.1
[PATCH v4 3/3] um: allow static linking for non-glibc implementations
It is possible to produce a statically linked UML binary with UML_NET_VECTOR, UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc implementations, which do not rely on NSS, such as musl. Allow static linking in this case. Signed-off-by: Ignat Korchagin Reviewed-by: Brendan Higgins Tested-by: Brendan Higgins --- arch/um/Kconfig | 6 +++--- arch/um/drivers/Kconfig | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 9318dc6d1a0c..d888b859c167 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -62,12 +62,12 @@ config NR_CPUS source "arch/$(HEADER_ARCH)/um/Kconfig" -config FORBID_STATIC_LINK - bool +config MAY_HAVE_RUNTIME_DEPS +bool config STATIC_LINK bool "Force a static link" - depends on !FORBID_STATIC_LINK + depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || !MAY_HAVE_RUNTIME_DEPS help This option gives you the ability to force a static link of UML. Normally, UML is linked as a shared binary. This is inconvenient for diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig index 9160ead56e33..2e7b8e0e7194 100644 --- a/arch/um/drivers/Kconfig +++ b/arch/um/drivers/Kconfig @@ -234,7 +234,7 @@ config UML_NET_DAEMON config UML_NET_VECTOR bool "Vector I/O high performance network devices" depends on UML_NET - select FORBID_STATIC_LINK + select MAY_HAVE_RUNTIME_DEPS help This User-Mode Linux network driver uses multi-message send and receive functions. The host running the UML guest must have @@ -246,7 +246,7 @@ config UML_NET_VECTOR config UML_NET_VDE bool "VDE transport (obsolete)" depends on UML_NET - select FORBID_STATIC_LINK + select MAY_HAVE_RUNTIME_DEPS help This User-Mode Linux network transport allows one or more running UMLs on a single host to communicate with each other and also @@ -294,7 +294,7 @@ config UML_NET_MCAST config UML_NET_PCAP bool "pcap transport (obsolete)" depends on UML_NET - select FORBID_STATIC_LINK + select MAY_HAVE_RUNTIME_DEPS help The pcap transport makes a pcap packet stream on the host look like an ethernet device inside UML. This is useful for making -- 2.20.1
[PATCH v4 1/3] um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
For statically linked UML build it is important to take into account the standard C-library implementation. Some implementations, notably glibc have caveats: even when linked statically, the final program might require some runtime dependencies, if certain functions are used within the code. Consider the following program: int main(void) { getpwent(); return 0; } Compiling this program and linking statically with glibc produces the following warning from the linker: /usr/sbin/ld: /tmp/ccuthw1o.o: in function `main': test.c:(.text+0x5): warning: Using 'getpwent' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking We will use the flag to detect such C-library implementation build time and possibly disable static linking for UML to avoid producing a binary with unexpected behaviour and dependencies. Signed-off-by: Ignat Korchagin Reviewed-by: Brendan Higgins Tested-by: Brendan Higgins --- init/Kconfig | 6 ++ scripts/cc-can-link.sh | 5 +++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index 0498af567f70..0a1ec56c9f33 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -57,6 +57,12 @@ config CC_CAN_LINK_STATIC default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(m64-flag) -static) if 64BIT default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(m32-flag) -static) +config CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS + bool + depends on UML && CC_CAN_LINK_STATIC + default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(m64-flag) -static -Xlinker --fatal-warnings) if 64BIT + default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(m32-flag) -static -Xlinker --fatal-warnings) + config CC_HAS_ASM_GOTO def_bool $(success,$(srctree)/scripts/gcc-goto.sh $(CC)) diff --git a/scripts/cc-can-link.sh b/scripts/cc-can-link.sh index 6efcead31989..e5011a46103e 100755 --- a/scripts/cc-can-link.sh +++ b/scripts/cc-can-link.sh @@ -2,10 +2,11 @@ # SPDX-License-Identifier: GPL-2.0 cat << "END" | $@ -x c - -o /dev/null >/dev/null 2>&1 -#include +#include +#include int main(void) { - printf(""); + getpwent(); return 0; } END -- 2.20.1
[PATCH v4 0/3] um: allow static linking for non-glibc libc implementations
Changes from v3: * restored FORBID_STATIC_LINK and renamed it to MAY_HAVE_RUNTIME_DEPS, so we don't have to maintain a static list of options, which may produce runtime dependencies This is a continuation of [1]. Since I was able to produce a working UML binary with UML_NET_VECTOR linked with musl with the changes included in the patches here. I was compiling on Arch Linux, so hopefully all the latest versions of the compiler, libraries and binutils. I also tested allyesconfig with both musl and glibc. The compilation succeeds with both, however both binaries (glibc one being dynamically linked) segfault on start. This is probably of some incompatible config option/module being included and not related to musl/glibc. [1]: https://patchwork.ozlabs.org/project/linux-um/patch/20200624212319.403689-1-ig...@cloudflare.com/ Ignat Korchagin (3): um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS um: some fixes to build UML with musl um: allow static linking for non-glibc implementations arch/um/Kconfig | 6 +++--- arch/um/drivers/Kconfig | 6 +++--- arch/um/drivers/daemon_user.c | 1 + arch/um/drivers/pcap_user.c | 12 ++-- arch/um/drivers/slip_user.c | 2 +- arch/um/drivers/vector_user.c | 4 +--- arch/um/os-Linux/util.c | 2 +- arch/x86/um/user-offsets.c| 2 +- init/Kconfig | 6 ++ scripts/cc-can-link.sh| 5 +++-- 10 files changed, 26 insertions(+), 20 deletions(-) -- 2.20.1
Re: [PATCH v3 3/3] um: allow static linking for non-glibc implementations
On Thu, Jul 16, 2020 at 10:10 AM Johannes Berg wrote: > > On Wed, 2020-07-15 at 21:11 +0100, Ignat Korchagin wrote: > > It is possible to produce a statically linked UML binary with > > UML_NET_VECTOR, > > UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc > > implementations, which do not rely on NSS, such as musl. > > > > Allow static linking in this case. > > > > Signed-off-by: Ignat Korchagin > > Reviewed-by: Brendan Higgins > > Tested-by: Brendan Higgins > > --- > > arch/um/Kconfig | 5 + > > arch/um/drivers/Kconfig | 3 --- > > 2 files changed, 1 insertion(+), 7 deletions(-) > > > > diff --git a/arch/um/Kconfig b/arch/um/Kconfig > > index 9318dc6d1a0c..beb98b3b9f75 100644 > > --- a/arch/um/Kconfig > > +++ b/arch/um/Kconfig > > @@ -62,12 +62,9 @@ config NR_CPUS > > > > source "arch/$(HEADER_ARCH)/um/Kconfig" > > > > -config FORBID_STATIC_LINK > > - bool > > - > > config STATIC_LINK > > bool "Force a static link" > > - depends on !FORBID_STATIC_LINK > > + depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || (!UML_NET_VECTOR && > > !UML_NET_VDE && !UML_NET_PCAP) > > Come to think of it, in a way "FORBID_STATIC_LINK" was nicer because > there didn't need to be a single list of "what has dynamic dependencies" > like here the list of UML_NET_VECTOR, UML_NET_VDE, UML_NET_PCAP. > > Maybe it could be > > config MAY_HAVE_NON_STATIC_RUNTIME_DEPS > bool > > config STATIC_LINK > ... > depends on !MAY_HAVE_NON_STATIC_RUNTIME_DEPS || > CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS > > > and then UML_NET_VECTOR et al can > > select MAY_HAVE_NON_STATIC_RUNTIME_DEPS > > so that the knowledge is still distributed to the corresponding options? Yes, makes sense. I've shortened it to MAY_HAVE_RUNTIME_DEPS and reposted the series. > johannes >
[PATCH v3] dm crypt: add flags to optionally bypass dm-crypt workqueues
Changes from v2: * dropped nobacklog boolean - ciphers are OK to backlog requests * moved some conditionals inline dropping the extra local variables * renamed "noresched" -> "atomic" This is a follow up from [1]. Consider the following script: sudo modprobe brd rd_nr=1 rd_size=4194304 echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \ sudo dmsetup create eram0 echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 no_write_workqueue' | \ sudo dmsetup create eram0-inline-write echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 no_read_workqueue' | \ sudo dmsetup create eram0-inline-read devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read " devices+="/dev/mapper/eram0-inline-write" for dev in $devices; do echo "reading from $dev" sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \ --loops=100 --runtime=3m --name=plain | grep READ done for dev in $devices; do echo "writing to $dev" sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \ --loops=100 --runtime=3m --name=plain | grep WRITE done This script creates a ramdisk (to eliminate hardware bias in the benchmark) and three dm-crypt instances on top. All dm-crypt instances use the NULL cipher to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy for "encyption"). The first instance is the current dm-crypt implementation from 5.8-rc2, the two others have new optional flags enabled, which bypass kcryptd workqueues for reads and writes respectively and write sorting for writes. On my VM (Debian in VirtualBox with 4 cores on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for better readability): reading from /dev/ram0 READ: bw=508MiB/s (533MB/s), 508MiB/s-508MiB/s (533MB/s-533MB/s), io=89.3GiB (95.9GB), run=18-18msec reading from /dev/mapper/eram0 READ: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), io=14.2GiB (15.2GB), run=18-18msec reading from /dev/mapper/eram0-inline-read READ: bw=295MiB/s (309MB/s), 295MiB/s-295MiB/s (309MB/s-309MB/s), io=51.8GiB (55.6GB), run=18-18msec reading from /dev/mapper/eram0-inline-write READ: bw=114MiB/s (120MB/s), 114MiB/s-114MiB/s (120MB/s-120MB/s), io=20.1GiB (21.5GB), run=18-18msec writing to /dev/ram0 WRITE: bw=516MiB/s (541MB/s), 516MiB/s-516MiB/s (541MB/s-541MB/s), io=90.7GiB (97.4GB), run=180001-180001msec writing to /dev/mapper/eram0 WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), io=7271MiB (7624MB), run=180001-180001msec writing to /dev/mapper/eram0-inline-read WRITE: bw=38.9MiB/s (40.8MB/s), 38.9MiB/s-38.9MiB/s (40.8MB/s-40.8MB/s), io=7000MiB (7340MB), run=180001-180001msec writing to /dev/mapper/eram0-inline-write WRITE: bw=277MiB/s (290MB/s), 277MiB/s-277MiB/s (290MB/s-290MB/s), io=48.6GiB (52.2GB), run=18-18msec Current dm-crypt implementation creates a significant IO performance overhead (at least on small IO block sizes) for both latency and throughput. We suspect offloading IO request processing into workqueues and async threads is more harmful these days with the modern fast storage. I also did some digging into the dm-crypt git history and much of this async processing is not needed anymore, because the reasons it was added are mostly gone from the kernel. More details can be found in [2] (see "Git archeology" section). This change adds no_(read|write)_workqueue flags separately for read and write BIOs, which direct dm-crypt not to offload crypto operations into kcryptd workqueues and process everything inline. In addition, writes are not buffered to be sorted in the dm-crypt red-black tree, but dispatched immediately. For cases, where crypto operations cannot happen inline (hard interrupt context, for example the read path of some NVME drivers), we offload the work to a tasklet rather than a workqueue. These flags ensure inline BIO processing in the dm-crypt module only. It is worth noting that some Crypto API implementations may offload encryption into their own workqueues, which are independent of the dm-crypt and its configuration. However upon enabling no_(read|write)_workqueue flags dm-crypt will instruct Crypto API not to backlog crypto requests. [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 50 --- 1 file changed, 42 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 000ddfab5ba0..7536ecb2c95d 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -69,6 +69,7 @@ struct dm_crypt_io { u8 *integrity_metadata; bool integrity_metadata_from_pool; struct work_struc
Re: [dm-devel] [PATCH v2] dm crypt: add flags to optionally bypass dm-crypt workqueues
On Mon, Jul 6, 2020 at 3:28 PM Bob Liu wrote: > > Hi Ignat, > > On 6/27/20 5:03 AM, Ignat Korchagin wrote: > > This is a follow up from [1]. Consider the following script: > > > > sudo modprobe brd rd_nr=1 rd_size=4194304 > > > > Did you test null_blk device? I didn't get result as expected using null_blk. I've modified the script in the patch description to use a null_blk device instead of /dev/ram0, but I created the block device with queue_mode=0. My results are as follows: reading from /dev/nullb0 READ: bw=527MiB/s (552MB/s), 527MiB/s-527MiB/s (552MB/s-552MB/s), io=92.6GiB (99.4GB), run=18-18msec reading from /dev/mapper/eram0 READ: bw=112MiB/s (117MB/s), 112MiB/s-112MiB/s (117MB/s-117MB/s), io=19.7GiB (21.1GB), run=180001-180001msec reading from /dev/mapper/eram0-inline-read READ: bw=288MiB/s (302MB/s), 288MiB/s-288MiB/s (302MB/s-302MB/s), io=50.6GiB (54.3GB), run=18-18msec reading from /dev/mapper/eram0-inline-write READ: bw=123MiB/s (129MB/s), 123MiB/s-123MiB/s (129MB/s-129MB/s), io=21.6GiB (23.1GB), run=18-18msec writing to /dev/nullb0 WRITE: bw=510MiB/s (535MB/s), 510MiB/s-510MiB/s (535MB/s-535MB/s), io=89.7GiB (96.3GB), run=180001-180001msec writing to /dev/mapper/eram0 WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=8264MiB (8666MB), run=180001-180001msec writing to /dev/mapper/eram0-inline-read WRITE: bw=42.5MiB/s (44.6MB/s), 42.5MiB/s-42.5MiB/s (44.6MB/s-44.6MB/s), io=7658MiB (8030MB), run=180001-180001msec writing to /dev/mapper/eram0-inline-write WRITE: bw=259MiB/s (272MB/s), 259MiB/s-259MiB/s (272MB/s-272MB/s), io=45.6GiB (48.0GB), run=18-18msec Still see the same improvement. Can you retry your test with a queue_mode=0 null_blk device? Thanks, Ignat > 1. > # fio --filename=/dev/nullb0 --readwrite=readwrite --bs=4k --direct=1 --loops >=100 --name=plain > Run status group 0 (all jobs): > READ: bw=390MiB/s (409MB/s), 390MiB/s-390MiB/s (409MB/s-409MB/s), > io=10.6GiB (11.3GB), run=27744-27744msec > WRITE: bw=390MiB/s (409MB/s), 390MiB/s-390MiB/s (409MB/s-409MB/s), > io=10.6GiB (11.3GB), run=27744-27744msec > > 2. > Create enctrypted-ram0 based on null_blk(without this patch): > # cryptsetup open --header crypthdr.img /dev/nullb0 encrypted-ram0 > # fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k > --direct=1 --loops=100 --name=crypt > Run status group 0 (all jobs): > READ: bw=180MiB/s (188MB/s), 180MiB/s-180MiB/s (188MB/s-188MB/s), > io=4534MiB (4754MB), run=25246-25246msec > WRITE: bw=179MiB/s (188MB/s), 179MiB/s-179MiB/s (188MB/s-188MB/s), > io=4528MiB (4748MB), run=25246-25246msec > > 3. > Create enctrypted-ram0 based on null_blk(with this patch): > # cryptsetup open --header crypthdr.img /dev/nullb0 encrypted-ram0 > # fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k > --direct=1 --loops=100 --name=crypt.patched > Run status group 0 (all jobs): > READ: bw=149MiB/s (156MB/s), 149MiB/s-149MiB/s (156MB/s-156MB/s), > io=4128MiB (4329MB), run=27753-27753msec > WRITE: bw=149MiB/s (156MB/s), 149MiB/s-149MiB/s (156MB/s-156MB/s), > io=4124MiB (4324MB), run=27753-27753msec > > Looks like the result is worse after this patch, or I may miss something.. > > Regards, > Bob > > > > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \ > > sudo dmsetup create eram0 > > > > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 > > no_write_workqueue' | \ > > sudo dmsetup create eram0-inline-write > > > > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 > > no_read_workqueue' | \ > > sudo dmsetup create eram0-inline-read > > > > devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read " > > devices+="/dev/mapper/eram0-inline-write" > > > > for dev in $devices; do > > echo "reading from $dev" > > sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \ > > --loops=100 --runtime=3m --name=plain | grep READ > > done > > > > for dev in $devices; do > > echo "writing to $dev" > > sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \ > > --loops=100 --runtime=3m --name=plain | grep WRITE > > done > > > > This script creates a ramdisk (to eliminate hardware bias in the benchmark) > > and > > three dm-crypt instances on top. All dm-crypt instances use the NULL cipher > > to eliminate potentially expensive crypto bias (the NULL cipher just uses > > memcpy > > for "encyption"). The first instance is the current dm-crypt implementation &g
Re: [RFC PATCH] Revert "um: Make CONFIG_STATIC_LINK actually static"
On Tue, Jun 30, 2020 at 11:47 PM Brendan Higgins wrote: > > On Wed, Jun 24, 2020 at 2:23 PM Ignat Korchagin wrote: > > > > This reverts commit 3363179385629c1804ea846f4e72608c2201a81e. > > > > This change is too restrictive. I've been running UML statically linked > > kernel > > with UML_NET_VECTOR networking in a docker "FROM: scratch" container just > > fine. > > As long as we don't reference network peers by hostname and use only IP > > addresses, NSS is not needed, so not used. In other words, it is possible to > > have statically linked UML and UML_NET_VECTOR (and other networking types) > > and > > use it, although with some restrictions, so let's not disable it. > > > > Additionally, it should be at least theoretically possible to use another > > libc > > (like musl, bionic etc) for static linking. I was able with some hacks to > > compile UML against musl, although the executable segfaults for now. But > > this > > option prevents even the research to be done. > > The reason that we removed support for static linking when these > networking options are enabled is because gcc doesn't support loading > NSS when statically linked, which consequently breaks allyesconfig for > UML under gcc. That is still the case with your revert. Yes, sure. But I'm not only "researching", but using UML as a "router" in one of my dev setups. 3363179385629c1804ea846f4e72608c2201a81e mentions UML_NET_VECTOR incompatibility (and some other networking options), which I had enabled and actually my whole dev networking is based around UML_NET_VECTOR: I have two interfaces - one in raw mode and one doing ipsec. All this was running in an empty "FROM: scratch" container and obviously linked statically. If the static linking breaks some other config options in allyesconfig - that's another story, but I wanted to point out that config options mentioned in the commit message worked just fine (if not trying to resolve hostnames). In other words: if you don't resolve - glibc will not try to load NSS. glibc-nss is a known problem and I would assume most people trying to do static linking are aware of this - that is, if they choose this path they are willing to live with the consequences. That's why completely disabling the possibility sounds too restrictive for me. > I fully support the goal behind what you are trying to do. However, I > do not want to see this patch accepted unless it is accompanied by an > alternative change that still allows UML to compile under > allyesconfig. If I succeed linking it to musl (or other non-glibc lib), would that be enough? > You said that in the current state, researching a solution is > possible? Can you not research a solution with your patch applied to > your own branch? As a side note: I tried to revert this patch and statically link 5.7 UML with glibc, but the binary still segfaults on start, so I would assume it is not related to my previous attempts linking with musl. Regards, Ignat > > Cheers
Re: ipmi_msghandler crashes in 4.19
We're rolling out 4.19.18 across the fleet. Hopefully, we'll not see it anymore, but if we do, we'll let you know. Regards, Ignat On Tue, Jan 29, 2019 at 10:29 AM Greg KH wrote: > > On Tue, Jan 15, 2019 at 10:36:42AM -0800, Ivan Babrou wrote: > > Hey, > > > > We've upgraded some machines from 4.14 to 4.19 and started seeing rare > > crashes like these: > > > > [75855.909507] BUG: unable to handle kernel NULL pointer dereference > > at 0d00 > > [75855.925667] PGD 0 P4D 0 > > [75855.936359] Oops: [#1] SMP PTI > > [75855.947951] CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G O > > 4.19.13-cloudflare-2019.1.4 #2019.1.4 > > [75855.966028] Hardware name: Quanta Cloud Technology Inc. QuantaPlex > > T42S-2U(LBG-4) -/T42S-2U MB (Lewisburg-4), BIOS 3A11.Q10 06/29/2018 > > [75855.994246] RIP: 0010:__srcu_read_unlock+0xe/0x20 > > [75856.006851] Code: 01 48 63 c8 65 48 ff 04 ca f0 83 44 24 fc 00 c3 > > 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 > > 48 63 f6 <48> 8b 87 e8 0c 00 00 65 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f > > 44 00 > > [75856.041551] RSP: 0018:ba00cc66fd48 EFLAGS: 00010286 > > [75856.054564] RAX: RBX: RCX: > > > > [75856.069449] RDX: 0001 RSI: RDI: > > 0018 > > [75856.084168] RBP: a28276abb200 R08: a29119772540 R09: > > > > [75856.098756] R10: 000c1425 R11: a29120a201c8 R12: > > a29118d57e08 > > [75856.113422] R13: dead0200 R14: dead0100 R15: > > a27dcbafa400 > > [75856.127798] FS: () GS:a29120a0() > > knlGS: > > [75856.138973] perf: interrupt took too long (7735 > 7677), lowering > > kernel.perf_event_max_sample_rate to 25000 > > [75856.143083] CS: 0010 DS: ES: CR0: 80050033 > > [75856.172956] CR2: 0d00 CR3: 00187ca0a005 CR4: > > 007606f0 > > [75856.187116] DR0: DR1: DR2: > > > > [75856.201312] DR3: DR6: fffe0ff0 DR7: > > 0400 > > [75856.215274] PKRU: 5554 > > [75856.224621] Call Trace: > > [75856.230942] perf: interrupt took too long (9748 > 9668), lowering > > kernel.perf_event_max_sample_rate to 2 > > [75856.233560] deliver_response+0x88/0xd0 [ipmi_msghandler] > > [75856.261744] deliver_local_response+0xe/0x30 [ipmi_msghandler] > > [75856.273937] handle_one_recv_msg+0x164/0xbf0 [ipmi_msghandler] > > [75856.285962] ? __switch_to_asm+0x34/0x70 > > [75856.295957] ? __switch_to_asm+0x40/0x70 > > [75856.306011] ? __switch_to_asm+0x34/0x70 > > [75856.315872] ? __switch_to_asm+0x40/0x70 > > [75856.325562] ? __switch_to_asm+0x34/0x70 > > [75856.325565] ? __switch_to_asm+0x40/0x70 > > [75856.325567] ? __switch_to_asm+0x34/0x70 > > [75856.325569] ? __switch_to_asm+0x40/0x70 > > [75856.325578] handle_new_recv_msgs+0x16d/0x1e0 [ipmi_msghandler] > > [75856.325583] ? __switch_to_asm+0x34/0x70 > > [75856.381815] tasklet_action_common.isra.21+0x4e/0xf0 > > [75856.381823] __do_softirq+0xd8/0x2d2 > > [75856.399498] ? sort_range+0x20/0x20 > > [75856.399506] run_ksoftirqd+0x1a/0x20 > > [75856.415184] smpboot_thread_fn+0xc5/0x160 > > [75856.415190] kthread+0x113/0x130 > > [75856.430502] ? kthread_create_worker_on_cpu+0x70/0x70 > > [75856.430512] ret_from_fork+0x35/0x40 > > [75856.446793] Modules linked in: xt_connlimit nf_conncount xt_bpf > > xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt > > algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6 > > ip6table_mangle ip6table_security ip6table_raw ip6table_filter > > ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TPROXY > > nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark iptable_mangle xt_owner > > xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6 iptable_raw > > nfnetlink_log xt_NFLOG xt_tcpudp xt_comment xt_conntrack nf_conntrack > > nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_multiport xt_set > > iptable_filter bpfilter ip_set_hash_netport ip_set_hash_net > > ip_set_hash_ip ip_set nfnetlink 8021q garp mrp stp llc skx_edac > > x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32_pclmul crc32c_intel > > ipmi_ssif pcbc aesni_intel aes_x86_64 crypto_simd sfc(O) > > [75856.446862] cryptd glue_helper mdio ipmi_si xhci_pci i40e tpm_crb > > ioatdma ipmi_devintf xhci_hcd dca ipmi_msghandler tpm_tis tpm_tis_core > > tpm efivarfs ip_tables x_tables > > [75856.569103] CR2: 0d00 > > [75856.569124] ---[ end trace 604e13a0789ee766 ]--- > > > > [117620.868720] general protection fault: [#1] SMP PTI > > [117620.911871] CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G > > O 4.19.0-cloudflare-2018.10.3 #1 > > [117620.937885] Hardware name: Quanta Computer Inc QuantaPlex > > T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018 > > [117620.963750] RIP: 0010:__srcu_read_unlock+0xe/0x20 > > [117620.984950] Code: 01 48 63 c8 65 48 ff 04 ca f0 83 44 24 fc
[PATCH v2] dm crypt: add flags to optionally bypass dm-crypt workqueues
This is a follow up from [1]. Consider the following script: sudo modprobe brd rd_nr=1 rd_size=4194304 echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \ sudo dmsetup create eram0 echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 no_write_workqueue' | \ sudo dmsetup create eram0-inline-write echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 no_read_workqueue' | \ sudo dmsetup create eram0-inline-read devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read " devices+="/dev/mapper/eram0-inline-write" for dev in $devices; do echo "reading from $dev" sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \ --loops=100 --runtime=3m --name=plain | grep READ done for dev in $devices; do echo "writing to $dev" sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \ --loops=100 --runtime=3m --name=plain | grep WRITE done This script creates a ramdisk (to eliminate hardware bias in the benchmark) and three dm-crypt instances on top. All dm-crypt instances use the NULL cipher to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy for "encyption"). The first instance is the current dm-crypt implementation from 5.8-rc2, the two others have new optional flags enabled, which bypass kcryptd workqueues for reads and writes respectively and write sorting for writes. On my VM (Debian in VirtualBox with 4 cores on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for better readability): reading from /dev/ram0 READ: bw=508MiB/s (533MB/s), 508MiB/s-508MiB/s (533MB/s-533MB/s), io=89.3GiB (95.9GB), run=18-18msec reading from /dev/mapper/eram0 READ: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), io=14.2GiB (15.2GB), run=18-18msec reading from /dev/mapper/eram0-inline-read READ: bw=295MiB/s (309MB/s), 295MiB/s-295MiB/s (309MB/s-309MB/s), io=51.8GiB (55.6GB), run=18-18msec reading from /dev/mapper/eram0-inline-write READ: bw=114MiB/s (120MB/s), 114MiB/s-114MiB/s (120MB/s-120MB/s), io=20.1GiB (21.5GB), run=18-18msec writing to /dev/ram0 WRITE: bw=516MiB/s (541MB/s), 516MiB/s-516MiB/s (541MB/s-541MB/s), io=90.7GiB (97.4GB), run=180001-180001msec writing to /dev/mapper/eram0 WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), io=7271MiB (7624MB), run=180001-180001msec writing to /dev/mapper/eram0-inline-read WRITE: bw=38.9MiB/s (40.8MB/s), 38.9MiB/s-38.9MiB/s (40.8MB/s-40.8MB/s), io=7000MiB (7340MB), run=180001-180001msec writing to /dev/mapper/eram0-inline-write WRITE: bw=277MiB/s (290MB/s), 277MiB/s-277MiB/s (290MB/s-290MB/s), io=48.6GiB (52.2GB), run=18-18msec Current dm-crypt implementation creates a significant IO performance overhead (at least on small IO block sizes) for both latency and throughput. We suspect offloading IO request processing into workqueues and async threads is more harmful these days with the modern fast storage. I also did some digging into the dm-crypt git history and much of this async processing is not needed anymore, because the reasons it was added are mostly gone from the kernel. More details can be found in [2] (see "Git archeology" section). This change adds no_(read|write)_workqueue flags separately for read and write BIOs, which direct dm-crypt not to offload crypto operations into kcryptd workqueues and process everything inline. In addition, writes are not buffered to be sorted in the dm-crypt red-black tree, but dispatched immediately. For cases, where crypto operations cannot happen inline (hard interrupt context, for example the read path of some NVME drivers), we offload the work to a tasklet rather than a workqueue. These flags ensure inline BIO processing in the dm-crypt module only. It is worth noting that some Crypto API implementations may offload encryption into their own workqueues, which are independent of the dm-crypt and its configuration. However upon enabling no_(read|write)_workqueue flags dm-crypt will instruct Crypto API not to backlog crypto requests. [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 68 +-- 1 file changed, 52 insertions(+), 16 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 000ddfab5ba0..6924eb49b1df 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -69,6 +69,7 @@ struct dm_crypt_io { u8 *integrity_metadata; bool integrity_metadata_from_pool; struct work_struct work; + struct tasklet_struct tasklet; struct convert_context ctx; @@ -127,7 +128,8 @@ struct iv_elephant_private { * and encrypts / decrypts at the same time. */ enum flags { D
[RFC PATCH 0/1] dm-crypt excessive overhead
This is a follow up from the long-forgotten [1], but with some more convincing evidence. Consider the following script: #!/bin/bash -e # create 4G ramdisk sudo modprobe brd rd_nr=1 rd_size=4194304 # create a dm-crypt device with NULL cipher on top of /dev/ram0 echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup create eram0 # create a dm-crypt device with NULL cipher and custom force_inline flag echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | sudo dmsetup create inline-eram0 # read all data from /dev/ram0 sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum # read the same data from /dev/mapper/eram0 sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum # read the same data from /dev/mapper/inline-eram0 sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum This script creates a ramdisk (to eliminate hardware bias in the benchmark) and two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy for "encyption"). The first instance is the current dm-crypt implementation from 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for better readability): # plain ram0 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca - # eram0 (current dm-crypt) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca - # inline-eram0 (patched dm-crypt) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca - As we can see, current dm-crypt implementation creates a significant IO performance overhead (at least on small IO block sizes) for both latency and throughput. We suspect offloading IO request processing into workqueues and async threads is more harmful these days with the modern fast storage. I also did some digging into the dm-crypt git history and much of this async processing is not needed anymore, because the reasons it was added are mostly gone from the kernel. More details can be found in [2] (see "Git archeology" section). We have been running the attached patch on different hardware generations in more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very happy with the performance benefits. [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ Ignat Korchagin (1): Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target drivers/md/dm-crypt.c | 55 +-- 1 file changed, 43 insertions(+), 12 deletions(-) -- 2.20.1
[RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is especially visible on busy systems with many processes/threads. Moreover, most Crypto API implementaions are async, that is they offload crypto operations on their own, so this dm-crypt offloading is excessive. This adds a new flag, which directs dm-crypt not to offload crypto operations and process everything inline. For cases, where crypto operations cannot happen inline (hard interrupt context, for example the read path of the NVME driver), we offload the work to a tasklet rather than a workqueue. Signed-off-by: Ignat Korchagin --- drivers/md/dm-crypt.c | 55 +-- 1 file changed, 43 insertions(+), 12 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 000ddfab5ba0..5a9bac4fdffb 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -69,6 +69,7 @@ struct dm_crypt_io { u8 *integrity_metadata; bool integrity_metadata_from_pool; struct work_struct work; + struct tasklet_struct tasklet; struct convert_context ctx; @@ -127,7 +128,7 @@ struct iv_elephant_private { * and encrypts / decrypts at the same time. */ enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID, -DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD }; +DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = (sizeof(unsigned long) * 8 - 1) }; enum cipher_flags { CRYPT_MODE_INTEGRITY_AEAD, /* Use authenticated mode for cihper */ @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc, skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]); - /* -* Use REQ_MAY_BACKLOG so a cipher driver internally backlogs -* requests if driver request queue is full. -*/ - skcipher_request_set_callback(ctx->r.req, - CRYPTO_TFM_REQ_MAY_BACKLOG, - kcryptd_async_done, dmreq_of_req(cc, ctx->r.req)); + if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) + /* make sure we zero important fields of the request */ + skcipher_request_set_callback(ctx->r.req, + 0, NULL, NULL); + else + /* +* Use REQ_MAY_BACKLOG so a cipher driver internally backlogs +* requests if driver request queue is full. +*/ + skcipher_request_set_callback(ctx->r.req, + CRYPTO_TFM_REQ_MAY_BACKLOG, + kcryptd_async_done, dmreq_of_req(cc, ctx->r.req)); } static void crypt_alloc_req_aead(struct crypt_config *cc, @@ -1566,7 +1572,8 @@ static blk_status_t crypt_convert(struct crypt_config *cc, atomic_dec(&ctx->cc_pending); ctx->cc_sector += sector_step; tag_offset++; - cond_resched(); + if (!test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) + cond_resched(); continue; /* * There was a data integrity error. @@ -1892,6 +1899,11 @@ static void kcryptd_crypt_write_io_submit(struct dm_crypt_io *io, int async) clone->bi_iter.bi_sector = cc->start + io->sector; + if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) { + generic_make_request(clone); + return; + } + if (likely(!async) && test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags)) { generic_make_request(clone); return; @@ -2031,12 +2043,26 @@ static void kcryptd_crypt(struct work_struct *work) kcryptd_crypt_write_convert(io); } +static void kcryptd_crypt_tasklet(unsigned long work) +{ + kcryptd_crypt((struct work_struct *)work); +} + static void kcryptd_queue_crypt(struct dm_crypt_io *io) { struct crypt_config *cc = io->cc; - INIT_WORK(&io->work, kcryptd_crypt); - queue_work(cc->crypt_queue, &io->work); + if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) { + if (in_irq()) { + /* Crypto API will fail hard in hard IRQ context */ + tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work); + tasklet_schedule(&io->tasklet); + } else + kcryptd_crypt(&io->work); + } else { + INIT_WORK(&io->work, kcryptd_crypt); + queue_work(cc->crypt_queue, &io->work); + } } static void crypt_free_tfms_aead(struct crypt_config *cc) @@ -2838,7 +2864,7 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar struct crypt_config *cc = ti->private; struct dm_arg_set as;
Re: [RFC PATCH 0/1] dm-crypt excessive overhead
On Fri, Jun 19, 2020 at 7:39 PM Mikulas Patocka wrote: > > > > On Fri, 19 Jun 2020, Mike Snitzer wrote: > > > On Fri, Jun 19 2020 at 12:41pm -0400, > > Ignat Korchagin wrote: > > > > > This is a follow up from the long-forgotten [1], but with some more > > > convincing > > > evidence. Consider the following script: > > > > > > [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html > > > [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ > > > > > > Ignat Korchagin (1): > > > Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target > > > > > > drivers/md/dm-crypt.c | 55 +-- > > > 1 file changed, 43 insertions(+), 12 deletions(-) > > > > > > -- > > > 2.20.1 > > > > > > > Hi, > > > > I saw [2] and have been expecting something from cloudflare ever since. > > Nice to see this submission. > > > > There is useful context in your 0th patch header. I'll likely merge > > parts of this patch header with the more terse 1/1 header (reality is > > there only needed to be a single patch submission). > > > > Will review and stage accordingly if all looks fine to me. Mikulas, > > please have a look too. > > > > Thanks, > > Mike > > + if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) { > + if (in_irq()) { > + /* Crypto API will fail hard in hard IRQ context */ > + tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, > (unsigned long)&io->work); > + tasklet_schedule(&io->tasklet); > + } else > + kcryptd_crypt(&io->work); > + } else { > + INIT_WORK(&io->work, kcryptd_crypt); > + queue_work(cc->crypt_queue, &io->work); > + } > > I'm looking at this and I'd like to know why does the crypto API fail in > hard-irq context and why does it work in tasklet context. What's the exact > reason behind this? This comes from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/crypto/skcipher.c?id=5e857ce6eae7ca21b2055cca4885545e29228fe2#n433 And, I believe, it is there for the same reason kcryptd was introduced in 2005 in dm-crypt: "...because it would be very unwise to do decryption in an interrupt context..." (that is, when other interrupts are disabled). In tasklet however we can still service other interrupts even if we process a large block. > > > Mikulas
Re: [RFC PATCH 0/1] dm-crypt excessive overhead
Yes, it should. I got one when I was testing the first iteration (without the tasklet) of the patch on an NVME? disk. On Sat, Jun 20, 2020 at 8:36 PM Mikulas Patocka wrote: > > > > On Sat, 20 Jun 2020, Herbert Xu wrote: > > > On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote: > > > > > > I'm looking at this and I'd like to know why does the crypto API fail in > > > hard-irq context and why does it work in tasklet context. What's the exact > > > reason behind this? > > > > You're not supposed to do any real work in IRQ handlers. All > > the substantial work should be postponed to softirq context. > > I see. > > BTW - should it also warn if it is running in a process context with > interrupts disabled? > > Mikulas > > > Why do you need to do work in hard IRQ context? > > > > Cheers, > > -- > > Email: Herbert Xu > > Home Page: http://gondor.apana.org.au/~herbert/ > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt > > >
Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead
On Mon, Jun 22, 2020 at 1:45 AM Damien Le Moal wrote: > > On 2020/06/20 1:56, Mike Snitzer wrote: > > On Fri, Jun 19 2020 at 12:41pm -0400, > > Ignat Korchagin wrote: > > > >> This is a follow up from the long-forgotten [1], but with some more > >> convincing > >> evidence. Consider the following script: > >> > >> #!/bin/bash -e > >> > >> # create 4G ramdisk > >> sudo modprobe brd rd_nr=1 rd_size=4194304 > >> > >> # create a dm-crypt device with NULL cipher on top of /dev/ram0 > >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo > >> dmsetup create eram0 > >> > >> # create a dm-crypt device with NULL cipher and custom force_inline flag > >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 > >> force_inline' | sudo dmsetup create inline-eram0 > >> > >> # read all data from /dev/ram0 > >> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum > >> > >> # read the same data from /dev/mapper/eram0 > >> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum > >> > >> # read the same data from /dev/mapper/inline-eram0 > >> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum > >> > >> This script creates a ramdisk (to eliminate hardware bias in the > >> benchmark) and > >> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher > >> to eliminate potentially expensive crypto bias (the NULL cipher just uses > >> memcpy > >> for "encyption"). The first instance is the current dm-crypt > >> implementation from > >> 5.8-rc1, the second is the dm-crypt instance with a custom new flag > >> enabled from > >> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 > >> cores > >> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted > >> for > >> better readability): > >> > >> # plain ram0 > >> 1048576+0 records in > >> 1048576+0 records out > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca - > >> > >> # eram0 (current dm-crypt) > >> 1048576+0 records in > >> 1048576+0 records out > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca - > >> > >> # inline-eram0 (patched dm-crypt) > >> 1048576+0 records in > >> 1048576+0 records out > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca - > >> > >> As we can see, current dm-crypt implementation creates a significant IO > >> performance overhead (at least on small IO block sizes) for both latency > >> and > >> throughput. We suspect offloading IO request processing into workqueues and > >> async threads is more harmful these days with the modern fast storage. I > >> also > >> did some digging into the dm-crypt git history and much of this async > >> processing > >> is not needed anymore, because the reasons it was added are mostly gone > >> from the > >> kernel. More details can be found in [2] (see "Git archeology" section). > >> > >> We have been running the attached patch on different hardware generations > >> in > >> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were > >> very > >> happy with the performance benefits. > >> > >> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html > >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ > >> > >> Ignat Korchagin (1): > >> Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target > >> > >> drivers/md/dm-crypt.c | 55 +-- > >> 1 file changed, 43 insertions(+), 12 deletions(-) > >> > >> -- > >> 2.20.1 > >> > > > > Hi, > > > > I saw [2] and have been expecting something from cloudflare ever since. > > Nice to see this submission. > > > > There is useful context in your 0th patch header. I'll likely merge > > parts of this patch header with the more terse 1/1 header (reality is > > there only needed to be a single patch submission). > > > > Will review
Re: [RFC PATCH 0/1] dm-crypt excessive overhead
On Wed, Jun 24, 2020 at 6:22 AM Mike Snitzer wrote: > > On Wed, Jun 24 2020 at 12:54am -0400, > Damien Le Moal wrote: > > > On 2020/06/24 0:23, Mike Snitzer wrote: > > > On Tue, Jun 23 2020 at 11:07am -0400, > > > Ignat Korchagin wrote: > > > > > >> Do you think it may be better to break it in two flags: one for read > > >> path and one for write? So, depending on the needs and workflow these > > >> could be enabled independently? > > > > > > If there is a need to split, then sure. But I think Damien had a hard > > > requirement that writes had to be inlined but that reads didn't _need_ > > > to be for his dm-zoned usecase. Damien may not yet have assessed the > > > performance implications, of not have reads inlined, as much as you > > > have. > > > > We did do performance testing :) > > The results are mixed and performance differences between inline vs > > workqueues > > depend on the workload (IO size, IO queue depth and number of drives being > > used > > mostly). In many cases, inlining everything does really improve performance > > as > > Ignat reported. > > > > In our testing, we used hard drives and so focused mostly on throughput > > rather > > than command latency. The added workqueue context switch overhead and crypto > > work latency compared to typical HDD IO times is small, and significant > > only if > > the backend storage as short IO times. > > > > In the case of HDDs, especially for large IO sizes, inlining crypto work > > does > > not shine as it prevents an efficient use of CPU resources. This is > > especially > > true with reads on a large system with many drives connected to a single > > HBA: > > the softirq context decryption work does not lend itself well to using other > > CPUs that did not receive the HBA IRQ signaling command completions. The > > test > > results clearly show much higher throughputs using dm-crypt as is. > > > > On the other hand, inlining crypto work significantly improves workloads of > > small random IOs, even for a large number of disks: removing the overhead of > > context switches allows faster completions, allowing sending more requests > > to > > the drives more quickly, keeping them busy. > > > > For SMR, the inlining of write requests is *mandatory* to preserve the > > issuer > > write sequence, but encryption work being done in the issuer context > > (writes to > > SMR drives can only be O_DIRECT writes), efficient CPU resource usage can be > > achieved by simply using multiple writer thread/processes, working on > > different > > zones of different disks. This is a very reasonable model for SMR as writes > > into > > a single zone have to be done under mutual exclusion to ensure > > sequentiality. > > > > For reads, SMR drives are essentially exactly the same as regular disks, so > > as-is or inline are both OK. Based on our performance results, allowing the > > user > > to have the choice of inlining or not reads based on the target workload > > would > > be great. > > > > Of note is that zone append writes (emulated in SCSI, native with NVMe) are > > not > > subject to the sequential write constraint, so they can also be executed > > either > > inline or asynchronously. > > > > > So let's see how Damien's work goes and if he trully doesn't need/want > > > reads to be inlined then 2 flags can be created. > > > > For SMR, I do not need inline reads, but I do want the user to have the > > possibility of using this setup as that can provide better performance for > > some > > workloads. I think that splitting the inline flag in 2 is exactly what we > > want: > > > > 1) For SMR, the write-inline flag can be automatically turned on when the > > target > > device is created if the backend device used is a host-managed zoned drive > > (scsi > > or NVMe ZNS). For reads, it would be the user choice, based on the target > > workload. > > 2) For regular block devices, write-inline only, read-inline only or both > > would > > be the user choice, to optimize for their target workload. > > > > With the split into 2 flags, my SMR support patch becomes very simple. > > OK, thanks for all the context. Was a fun read ;) > > SO let's run with splitting into 2 flags. Ignat would you be up to > tweaking your patch to provide that and post a v2? > > An added bonus would be to consolidate your 0/1 and 1/1 patch headers, > and add in the additional answers you provided in this thread to help > others understand the patch (mainly some more detail about why tasklet > is used). Yes, will do > Thanks, > Mike >
Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers wrote: > > On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote: > > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. > > This is > > especially visible on busy systems with many processes/threads. Moreover, > > most > > Crypto API implementaions are async, that is they offload crypto operations > > on > > their own, so this dm-crypt offloading is excessive. > > This really should say "some Crypto API implementations are async" instead of > "most Crypto API implementations are async". The most accurate would probably be: most hardware-accelerated Crypto API implementations are async > Notably, the AES-NI implementation of AES-XTS is synchronous if you call it > in a > context where SIMD instructions are usable. It's only asynchronous when SIMD > is > not usable. (This seems to have been missed in your blog post.) No, it was not. This is exactly why we made xts-proxy Crypto API module as a second patch. But it seems now it does not make a big difference if a used Crypto API implementation is synchronous as well (based on some benchmarks outlined in the cover letter to this patch). I think the v2 of this patch will not require a synchronous Crypto API. This is probably a right thing to do, as the "inline" flag should control the way how dm-crypt itself handles requests, not how Crypto API handles requests. If a user wants to ensure a particular synchronous Crypto API implementation, they can already reconfigure dm-crypt and specify the implementation with a "capi:" prefix in the the dm table description. > > This adds a new flag, which directs dm-crypt not to offload crypto > > operations > > and process everything inline. For cases, where crypto operations cannot > > happen > > inline (hard interrupt context, for example the read path of the NVME > > driver), > > we offload the work to a tasklet rather than a workqueue. > > This patch both removes some dm-crypt specific queueing, and changes > decryption > to use softIRQ context instead of a workqueue. It would be useful to know how > much of a difference the workqueue => softIRQ change makes by itself. Such a > change could be useful for fscrypt as well. (fscrypt uses a workqueue for > decryption, but besides that doesn't use any other queueing.) > > > @@ -127,7 +128,7 @@ struct iv_elephant_private { > > * and encrypts / decrypts at the same time. > > */ > > enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID, > > - DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD }; > > + DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = > > (sizeof(unsigned long) * 8 - 1) }; > > Assigning a specific enum value isn't necessary. Yes, this is a leftover from our "internal" patch which I wanted to make "future proof" in case future iterations of dm-crypt will add some flags to avoid flag collisions. Will remove in v2. > > > @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct > > crypt_config *cc, > > > > skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]); > > > > - /* > > - * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs > > - * requests if driver request queue is full. > > - */ > > - skcipher_request_set_callback(ctx->r.req, > > - CRYPTO_TFM_REQ_MAY_BACKLOG, > > - kcryptd_async_done, dmreq_of_req(cc, ctx->r.req)); > > + if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) > > + /* make sure we zero important fields of the request */ > > + skcipher_request_set_callback(ctx->r.req, > > + 0, NULL, NULL); > > + else > > + /* > > + * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs > > + * requests if driver request queue is full. > > + */ > > + skcipher_request_set_callback(ctx->r.req, > > + CRYPTO_TFM_REQ_MAY_BACKLOG, > > + kcryptd_async_done, dmreq_of_req(cc, ctx->r.req)); > > } > > This looks wrong. Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to > crypto_alloc_skcipher(), the skcipher implementation can still be > asynchronous, > in which case providing a callback is required. > > Do you intend that the "force_inline" option forces the use of a synchronous > skcipher (alongside the other things it does)? Or should it still allow > asynchronous ones? As mentioned above, I don't think we should require synchronous crypto with the "force_inline" flag anymore. Although we may remove CRYPTO_TFM_REQ_MAY_BACKLOG with the inline flag. > We may not actually have a choice in that matter, since xts-aes-aesni has the > CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most > cases; thus, the crypto API won't give you it if you ask for a synchronous > cipher. So I think you still need to allow async skciphers? That means a > callback is still always required. > > - Eric
Re: [PATCH v2 5/8] net: af_can: do not leave a dangling sk pointer in can_create()
On Tue, Oct 8, 2024 at 3:38 AM Vincent MAILHOL wrote: > > Hi Ignat, > > Thanks for the patch. > > On Tue. 8 Oct. 2024 at 06:37, Ignat Korchagin wrote: > > On error can_create() frees the allocated sk object, but sock_init_data() > > has already attached it to the provided sock object. This will leave a > > dangling sk pointer in the sock object and may cause use-after-free later. > > I was about to suggest that this should be backported to stable, but > after reading the cover letter, I now understand that this patch is > more an improvement to avoid false positives on kmemleak & Cie. Maybe > the description could be a bit more nuanced? After patch 1/8 of this > series, it seems that the use-after-free is not possible anymore. If we go with Kuniyuki's suggestion in the cover email to replace the explicit NULL with DEBUG_NET_WARN_ON_ONCE(sock->sk) in __sock_create() then use-after-free would be possible again except we will easily catch it. But I will review the description, when I respin the patches to net-next as requested by Jakub. > > Signed-off-by: Ignat Korchagin > > See above comment as notwithstanding. This said: > > Reviewed-by: Vincent Mailhol Thank you. > > > --- > > net/can/af_can.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/net/can/af_can.c b/net/can/af_can.c > > index 707576eeeb58..01f3fbb3b67d 100644 > > --- a/net/can/af_can.c > > +++ b/net/can/af_can.c > > @@ -171,6 +171,7 @@ static int can_create(struct net *net, struct socket > > *sock, int protocol, > > /* release sk on errors */ > > sock_orphan(sk); > > sock_put(sk); > > + sock->sk = NULL; > > } > > > > errout: > > -- > > 2.39.5 > > > >
[PATCH v2 8/8] inet6: do not leave a dangling sk pointer in inet6_create()
sock_init_data() attaches the allocated sk pointer to the provided sock object. If inet6_create() fails later, the sk object is released, but the sock object retains the dangling sk pointer, which may cause use-after-free later. Clear the sock sk pointer on error. Signed-off-by: Ignat Korchagin --- net/ipv6/af_inet6.c | 22 ++ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index ba69b86f1c7d..f60ec8b0f8ea 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -252,31 +252,29 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol, */ inet->inet_sport = htons(inet->inet_num); err = sk->sk_prot->hash(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } if (sk->sk_prot->init) { err = sk->sk_prot->init(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } if (!kern) { err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } out: return err; out_rcu_unlock: rcu_read_unlock(); goto out; +out_sk_release: + sk_common_release(sk); + sock->sk = NULL; + goto out; } static int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len, -- 2.39.5
[PATCH v2 1/8] net: explicitly clear the sk pointer, when pf->create fails
We have recently noticed the exact same KASAN splat as in commit 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails"). The problem is that commit did not fully address the problem, as some pf->create implementations do not use sk_common_release in their error paths. For example, we can use the same reproducer as in the above commit, but changing ping to arping. arping uses AF_PACKET socket and if packet_create fails, it will just sk_free the allocated sk object. While we could chase all the pf->create implementations and make sure they NULL the freed sk object on error from the socket, we can't guarantee future protocols will not make the same mistake. So it is easier to just explicitly NULL the sk pointer upon return from pf->create in __sock_create. We do know that pf->create always releases the allocated sk object on error, so if the pointer is not NULL, it is definitely dangling. Fixes: 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails") Signed-off-by: Ignat Korchagin Cc: sta...@vger.kernel.org --- net/core/sock.c | 3 --- net/socket.c| 7 ++- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 039be95c40cf..e6e04081949c 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3819,9 +3819,6 @@ void sk_common_release(struct sock *sk) sk->sk_prot->unhash(sk); - if (sk->sk_socket) - sk->sk_socket->sk = NULL; - /* * In this point socket cannot receive new packets, but it is possible * that some packets are in flight because some CPU runs receiver and diff --git a/net/socket.c b/net/socket.c index 601ad74930ef..042451f01c65 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1574,8 +1574,13 @@ int __sock_create(struct net *net, int family, int type, int protocol, rcu_read_unlock(); err = pf->create(net, sock, protocol, kern); - if (err < 0) + if (err < 0) { + /* ->create should release the allocated sock->sk object on error +* but it may leave the dangling pointer +*/ + sock->sk = NULL; goto out_module_put; + } /* * Now to bump the refcnt of the [loadable] module that owns this -- 2.39.5
[PATCH v2 0/8] do not leave dangling sk pointers in pf->create functions
Some protocol family create() implementations have an error path after allocating the sk object and calling sock_init_data(). sock_init_data() attaches the allocated sk object to the sock object, provided by the caller. If the create() implementation errors out after calling sock_init_data(), it releases the allocated sk object, but the caller ends up having a dangling sk pointer in its sock object on return. Subsequent manipulations on this sock object may try to access the sk pointer, because it is not NULL thus creating a use-after-free scenario. While the first patch in the series should be enough to handle this scenario Eric Dumazet suggested that it would be a good idea to refactor the code for the af_packet implementation to avoid the error path, which leaves a dangling pointer, because it may be better for some tools like kmemleak. I went a bit further and tried to actually fix all the implementations, which could potentially leave a dangling sk pointer. Changes in V2: * reverted the change introduced in 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails") * added optional commits to all pf->create implementaions to clear the sk pointer on error after sock_init_data() Ignat Korchagin (8): net: explicitly clear the sk pointer, when pf->create fails af_packet: avoid erroring out after sock_init_data() in packet_create() Bluetooth: L2CAP: do not leave dangling sk pointer on error in l2cap_sock_create() Bluetooth: RFCOMM: avoid leaving dangling sk pointer in rfcomm_sock_alloc() net: af_can: do not leave a dangling sk pointer in can_create() net: ieee802154: do not leave a dangling sk pointer in ieee802154_create() net: inet: do not leave a dangling sk pointer in inet_create() inet6: do not leave a dangling sk pointer in inet6_create() net/bluetooth/l2cap_sock.c | 1 + net/bluetooth/rfcomm/sock.c | 10 +- net/can/af_can.c| 1 + net/core/sock.c | 3 --- net/ieee802154/socket.c | 12 +++- net/ipv4/af_inet.c | 22 ++ net/ipv6/af_inet6.c | 22 ++ net/packet/af_packet.c | 12 ++-- net/socket.c| 7 ++- 9 files changed, 46 insertions(+), 44 deletions(-) -- 2.39.5
[PATCH v2 2/8] af_packet: avoid erroring out after sock_init_data() in packet_create()
After sock_init_data() the allocated sk object is attached to the provided sock object. On error, packet_create() frees the sk object leaving the dangling pointer in the sock object on return. Some other code may try to use this pointer and cause use-after-free. Suggested-by: Eric Dumazet Signed-off-by: Ignat Korchagin --- net/packet/af_packet.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index a705ec214254..97774bd4b6cb 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -3421,17 +3421,17 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, if (sock->type == SOCK_PACKET) sock->ops = &packet_ops_spkt; + po = pkt_sk(sk); + err = packet_alloc_pending(po); + if (err) + goto out_sk_free; + sock_init_data(sock, sk); - po = pkt_sk(sk); init_completion(&po->skb_completion); sk->sk_family = PF_PACKET; po->num = proto; - err = packet_alloc_pending(po); - if (err) - goto out2; - packet_cached_dev_reset(po); sk->sk_destruct = packet_sock_destruct; @@ -3463,7 +3463,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, sock_prot_inuse_add(net, &packet_proto, 1); return 0; -out2: +out_sk_free: sk_free(sk); out: return err; -- 2.39.5
[PATCH v2 4/8] Bluetooth: RFCOMM: avoid leaving dangling sk pointer in rfcomm_sock_alloc()
bt_sock_alloc() attaches allocated sk object to the provided sock object. If rfcomm_dlc_alloc() fails, we release the sk object, but leave the dangling pointer in the sock object, which may cause use-after-free. Fix this by swapping calls to bt_sock_alloc() and rfcomm_dlc_alloc(). Signed-off-by: Ignat Korchagin --- net/bluetooth/rfcomm/sock.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c index 37d63d768afb..0d0c4311da57 100644 --- a/net/bluetooth/rfcomm/sock.c +++ b/net/bluetooth/rfcomm/sock.c @@ -274,13 +274,13 @@ static struct sock *rfcomm_sock_alloc(struct net *net, struct socket *sock, struct rfcomm_dlc *d; struct sock *sk; - sk = bt_sock_alloc(net, sock, &rfcomm_proto, proto, prio, kern); - if (!sk) + d = rfcomm_dlc_alloc(prio); + if (!d) return NULL; - d = rfcomm_dlc_alloc(prio); - if (!d) { - sk_free(sk); + sk = bt_sock_alloc(net, sock, &rfcomm_proto, proto, prio, kern); + if (!sk) { + rfcomm_dlc_free(d); return NULL; } -- 2.39.5
[PATCH v2 3/8] Bluetooth: L2CAP: do not leave dangling sk pointer on error in l2cap_sock_create()
bt_sock_alloc() allocates the sk object and attaches it to the provided sock object. On error l2cap_sock_alloc() frees the sk object, but the dangling pointer is still attached to the sock object, which may create use-after-free in other code. Signed-off-by: Ignat Korchagin --- net/bluetooth/l2cap_sock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c index ba437c6f6ee5..18e89e764f3b 100644 --- a/net/bluetooth/l2cap_sock.c +++ b/net/bluetooth/l2cap_sock.c @@ -1886,6 +1886,7 @@ static struct sock *l2cap_sock_alloc(struct net *net, struct socket *sock, chan = l2cap_chan_create(); if (!chan) { sk_free(sk); + sock->sk = NULL; return NULL; } -- 2.39.5
[PATCH v2 7/8] net: inet: do not leave a dangling sk pointer in inet_create()
sock_init_data() attaches the allocated sk object to the provided sock object. If inet_create() fails later, the sk object is freed, but the sock object retains the dangling pointer, which may create use-after-free later. Clear the sk pointer in the sock object on error. Signed-off-by: Ignat Korchagin --- net/ipv4/af_inet.c | 22 ++ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index b24d74616637..8095e82de808 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -376,32 +376,30 @@ static int inet_create(struct net *net, struct socket *sock, int protocol, inet->inet_sport = htons(inet->inet_num); /* Add to protocol hash chains. */ err = sk->sk_prot->hash(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } if (sk->sk_prot->init) { err = sk->sk_prot->init(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } if (!kern) { err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } out: return err; out_rcu_unlock: rcu_read_unlock(); goto out; +out_sk_release: + sk_common_release(sk); + sock->sk = NULL; + goto out; } -- 2.39.5
[PATCH v2 5/8] net: af_can: do not leave a dangling sk pointer in can_create()
On error can_create() frees the allocated sk object, but sock_init_data() has already attached it to the provided sock object. This will leave a dangling sk pointer in the sock object and may cause use-after-free later. Signed-off-by: Ignat Korchagin --- net/can/af_can.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/can/af_can.c b/net/can/af_can.c index 707576eeeb58..01f3fbb3b67d 100644 --- a/net/can/af_can.c +++ b/net/can/af_can.c @@ -171,6 +171,7 @@ static int can_create(struct net *net, struct socket *sock, int protocol, /* release sk on errors */ sock_orphan(sk); sock_put(sk); + sock->sk = NULL; } errout: -- 2.39.5
[PATCH v2 6/8] net: ieee802154: do not leave a dangling sk pointer in ieee802154_create()
sock_init_data() attaches the allocated sk object to the provided sock object. If ieee802154_create() fails later, the allocated sk object is freed, but the dangling pointer remains in the provided sock object, which may allow use-after-free. Clear the sk pointer in the sock object on error. Signed-off-by: Ignat Korchagin --- net/ieee802154/socket.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c index 990a83455dcf..18d267921bb5 100644 --- a/net/ieee802154/socket.c +++ b/net/ieee802154/socket.c @@ -1043,19 +1043,21 @@ static int ieee802154_create(struct net *net, struct socket *sock, if (sk->sk_prot->hash) { rc = sk->sk_prot->hash(sk); - if (rc) { - sk_common_release(sk); - goto out; - } + if (rc) + goto out_sk_release; } if (sk->sk_prot->init) { rc = sk->sk_prot->init(sk); if (rc) - sk_common_release(sk); + goto out_sk_release; } out: return rc; +out_sk_release: + sk_common_release(sk); + sock->sk = NULL; + goto out; } static const struct net_proto_family ieee802154_family_ops = { -- 2.39.5
[PATCH net-next v3 3/9] Bluetooth: RFCOMM: avoid leaving dangling sk pointer in rfcomm_sock_alloc()
bt_sock_alloc() attaches allocated sk object to the provided sock object. If rfcomm_dlc_alloc() fails, we release the sk object, but leave the dangling pointer in the sock object, which may cause use-after-free. Fix this by swapping calls to bt_sock_alloc() and rfcomm_dlc_alloc(). Signed-off-by: Ignat Korchagin --- net/bluetooth/rfcomm/sock.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c index f48250e3f2e1..355e1a1698f5 100644 --- a/net/bluetooth/rfcomm/sock.c +++ b/net/bluetooth/rfcomm/sock.c @@ -274,13 +274,13 @@ static struct sock *rfcomm_sock_alloc(struct net *net, struct socket *sock, struct rfcomm_dlc *d; struct sock *sk; - sk = bt_sock_alloc(net, sock, &rfcomm_proto, proto, prio, kern); - if (!sk) + d = rfcomm_dlc_alloc(prio); + if (!d) return NULL; - d = rfcomm_dlc_alloc(prio); - if (!d) { - sk_free(sk); + sk = bt_sock_alloc(net, sock, &rfcomm_proto, proto, prio, kern); + if (!sk) { + rfcomm_dlc_free(d); return NULL; } -- 2.39.5
[PATCH net-next v3 2/9] Bluetooth: L2CAP: do not leave dangling sk pointer on error in l2cap_sock_create()
bt_sock_alloc() allocates the sk object and attaches it to the provided sock object. On error l2cap_sock_alloc() frees the sk object, but the dangling pointer is still attached to the sock object, which may create use-after-free in other code. Signed-off-by: Ignat Korchagin --- net/bluetooth/l2cap_sock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c index ba437c6f6ee5..18e89e764f3b 100644 --- a/net/bluetooth/l2cap_sock.c +++ b/net/bluetooth/l2cap_sock.c @@ -1886,6 +1886,7 @@ static struct sock *l2cap_sock_alloc(struct net *net, struct socket *sock, chan = l2cap_chan_create(); if (!chan) { sk_free(sk); + sock->sk = NULL; return NULL; } -- 2.39.5
[PATCH net-next v3 1/9] af_packet: avoid erroring out after sock_init_data() in packet_create()
After sock_init_data() the allocated sk object is attached to the provided sock object. On error, packet_create() frees the sk object leaving the dangling pointer in the sock object on return. Some other code may try to use this pointer and cause use-after-free. Suggested-by: Eric Dumazet Signed-off-by: Ignat Korchagin --- net/packet/af_packet.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index f8942062f776..99ae27d1e4dc 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -3421,17 +3421,17 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, if (sock->type == SOCK_PACKET) sock->ops = &packet_ops_spkt; + po = pkt_sk(sk); + err = packet_alloc_pending(po); + if (err) + goto out_sk_free; + sock_init_data(sock, sk); - po = pkt_sk(sk); init_completion(&po->skb_completion); sk->sk_family = PF_PACKET; po->num = proto; - err = packet_alloc_pending(po); - if (err) - goto out2; - packet_cached_dev_reset(po); sk->sk_destruct = packet_sock_destruct; @@ -3463,7 +3463,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, sock_prot_inuse_add(net, &packet_proto, 1); return 0; -out2: +out_sk_free: sk_free(sk); out: return err; -- 2.39.5
[PATCH net-next v3 4/9] net: af_can: do not leave a dangling sk pointer in can_create()
On error can_create() frees the allocated sk object, but sock_init_data() has already attached it to the provided sock object. This will leave a dangling sk pointer in the sock object and may cause use-after-free later. Signed-off-by: Ignat Korchagin Reviewed-by: Vincent Mailhol --- net/can/af_can.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/can/af_can.c b/net/can/af_can.c index 707576eeeb58..01f3fbb3b67d 100644 --- a/net/can/af_can.c +++ b/net/can/af_can.c @@ -171,6 +171,7 @@ static int can_create(struct net *net, struct socket *sock, int protocol, /* release sk on errors */ sock_orphan(sk); sock_put(sk); + sock->sk = NULL; } errout: -- 2.39.5
[PATCH net-next v3 7/9] net: inet6: do not leave a dangling sk pointer in inet6_create()
sock_init_data() attaches the allocated sk pointer to the provided sock object. If inet6_create() fails later, the sk object is released, but the sock object retains the dangling sk pointer, which may cause use-after-free later. Clear the sock sk pointer on error. Signed-off-by: Ignat Korchagin --- net/ipv6/af_inet6.c | 22 ++ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index ba69b86f1c7d..f60ec8b0f8ea 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -252,31 +252,29 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol, */ inet->inet_sport = htons(inet->inet_num); err = sk->sk_prot->hash(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } if (sk->sk_prot->init) { err = sk->sk_prot->init(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } if (!kern) { err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } out: return err; out_rcu_unlock: rcu_read_unlock(); goto out; +out_sk_release: + sk_common_release(sk); + sock->sk = NULL; + goto out; } static int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len, -- 2.39.5
[PATCH net-next v3 8/9] net: warn, if pf->create does not clear sock->sk on error
All pf->create implementations have been fixed now to clear sock->sk on error, when they deallocate the allocated sk object. Put a warning in place to make sure we don't break this promise in the future. Suggested-by: Kuniyuki Iwashima Signed-off-by: Ignat Korchagin --- net/socket.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/socket.c b/net/socket.c index 24b404299015..9a8e4452b9b2 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1576,9 +1576,9 @@ int __sock_create(struct net *net, int family, int type, int protocol, err = pf->create(net, sock, protocol, kern); if (err < 0) { /* ->create should release the allocated sock->sk object on error -* but it may leave the dangling pointer +* and make sure sock->sk is set to NULL to avoid use-after-free */ - sock->sk = NULL; + DEBUG_NET_WARN_ON_ONCE(sock->sk); goto out_module_put; } -- 2.39.5
[PATCH net-next v3 6/9] net: inet: do not leave a dangling sk pointer in inet_create()
sock_init_data() attaches the allocated sk object to the provided sock object. If inet_create() fails later, the sk object is freed, but the sock object retains the dangling pointer, which may create use-after-free later. Clear the sk pointer in the sock object on error. Signed-off-by: Ignat Korchagin --- net/ipv4/af_inet.c | 22 ++ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index b24d74616637..8095e82de808 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -376,32 +376,30 @@ static int inet_create(struct net *net, struct socket *sock, int protocol, inet->inet_sport = htons(inet->inet_num); /* Add to protocol hash chains. */ err = sk->sk_prot->hash(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } if (sk->sk_prot->init) { err = sk->sk_prot->init(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } if (!kern) { err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk); - if (err) { - sk_common_release(sk); - goto out; - } + if (err) + goto out_sk_release; } out: return err; out_rcu_unlock: rcu_read_unlock(); goto out; +out_sk_release: + sk_common_release(sk); + sock->sk = NULL; + goto out; } -- 2.39.5
[PATCH net-next v3 5/9] net: ieee802154: do not leave a dangling sk pointer in ieee802154_create()
sock_init_data() attaches the allocated sk object to the provided sock object. If ieee802154_create() fails later, the allocated sk object is freed, but the dangling pointer remains in the provided sock object, which may allow use-after-free. Clear the sk pointer in the sock object on error. Signed-off-by: Ignat Korchagin Reviewed-by: Miquel Raynal --- net/ieee802154/socket.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c index 990a83455dcf..18d267921bb5 100644 --- a/net/ieee802154/socket.c +++ b/net/ieee802154/socket.c @@ -1043,19 +1043,21 @@ static int ieee802154_create(struct net *net, struct socket *sock, if (sk->sk_prot->hash) { rc = sk->sk_prot->hash(sk); - if (rc) { - sk_common_release(sk); - goto out; - } + if (rc) + goto out_sk_release; } if (sk->sk_prot->init) { rc = sk->sk_prot->init(sk); if (rc) - sk_common_release(sk); + goto out_sk_release; } out: return rc; +out_sk_release: + sk_common_release(sk); + sock->sk = NULL; + goto out; } static const struct net_proto_family ieee802154_family_ops = { -- 2.39.5
[PATCH net-next v3 9/9] Revert "net: do not leave a dangling sk pointer, when socket creation fails"
This reverts commit 6cd4a78d962bebbaf8beb7d2ead3f34120e3f7b2. inet/inet6->create() implementations have been fixed to explicitly NULL the allocated sk object on error. A warning was put in place to make sure any future changes will not leave a dangling pointer in pf->create() implementations. So this code is now redundant. Suggested-by: Kuniyuki Iwashima Signed-off-by: Ignat Korchagin --- net/core/sock.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 083d438d8b6f..a9391cb796a2 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3830,9 +3830,6 @@ void sk_common_release(struct sock *sk) sk->sk_prot->unhash(sk); - if (sk->sk_socket) - sk->sk_socket->sk = NULL; - /* * In this point socket cannot receive new packets, but it is possible * that some packets are in flight because some CPU runs receiver and -- 2.39.5
[PATCH net-next v3 0/9] do not leave dangling sk pointers in pf->create functions
Some protocol family create() implementations have an error path after allocating the sk object and calling sock_init_data(). sock_init_data() attaches the allocated sk object to the sock object, provided by the caller. If the create() implementation errors out after calling sock_init_data(), it releases the allocated sk object, but the caller ends up having a dangling sk pointer in its sock object on return. Subsequent manipulations on this sock object may try to access the sk pointer, because it is not NULL thus creating a use-after-free scenario. We have implemented a stable hotfix in commit 631083143315 ("net: explicitly clear the sk pointer, when pf->create fails"), but this series aims to fix it properly by going through each of the pf->create() implementations and making sure they all don't return a sock object with a dangling pointer on error. Changes in V3: * retargeted the series to net-next * dropped the hotfix patch, which was merged into net already * replaced the hotfix code with DEBUG_NET_WARN_ON_ONCE() to catch future violations Changes in V2: * reverted the change introduced in 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails") * added optional commits to all pf->create implementaions to clear the sk pointer on error after sock_init_data() Ignat Korchagin (9): af_packet: avoid erroring out after sock_init_data() in packet_create() Bluetooth: L2CAP: do not leave dangling sk pointer on error in l2cap_sock_create() Bluetooth: RFCOMM: avoid leaving dangling sk pointer in rfcomm_sock_alloc() net: af_can: do not leave a dangling sk pointer in can_create() net: ieee802154: do not leave a dangling sk pointer in ieee802154_create() net: inet: do not leave a dangling sk pointer in inet_create() net: inet6: do not leave a dangling sk pointer in inet6_create() net: warn, if pf->create does not clear sock->sk on error Revert "net: do not leave a dangling sk pointer, when socket creation fails" net/bluetooth/l2cap_sock.c | 1 + net/bluetooth/rfcomm/sock.c | 10 +- net/can/af_can.c| 1 + net/core/sock.c | 3 --- net/ieee802154/socket.c | 12 +++- net/ipv4/af_inet.c | 22 ++ net/ipv6/af_inet6.c | 22 ++ net/packet/af_packet.c | 12 ++-- net/socket.c| 4 ++-- 9 files changed, 42 insertions(+), 45 deletions(-) -- 2.39.5