from:"Ignat Korchagin"

[PATCH] dm crypt: defer the decryption to a tasklet, when being called with interrupts disabled

2021-01-13 Thread Ignat Korchagin

On some specific hardware on early boot we occasionally get

[ 1193.920255][T0] BUG: sleeping function called from invalid context at 
mm/mempool.c:381
[ 1193.936616][T0] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
0, name: swapper/69
[ 1193.953233][T0] no locks held by swapper/69/0.
[ 1193.965871][T0] irq event stamp: 575062
[ 1193.977724][T0] hardirqs last  enabled at (575061): [] 
tick_nohz_idle_exit+0xe2/0x3e0
[ 1194.002762][T0] hardirqs last disabled at (575062): [] 
flush_smp_call_function_from_idle+0x4f/0x80
[ 1194.029035][T0] softirqs last  enabled at (575050): [] 
asm_call_irq_on_stack+0x12/0x20
[ 1194.054227][T0] softirqs last disabled at (575043): [] 
asm_call_irq_on_stack+0x12/0x20
[ 1194.079389][T0] CPU: 69 PID: 0 Comm: swapper/69 Not tainted 
5.10.6-cloudflare-kasan-2021.1.4-dev #1
[ 1194.104103][T0] Hardware name: NULL R162-Z12-CD/MZ12-HD4-CD, BIOS R10 
06/04/2020
[ 1194.119591][T0] Call Trace:
[ 1194.130233][T0]  dump_stack+0x9a/0xcc
[ 1194.141617][T0]  ___might_sleep.cold+0x180/0x1b0
[ 1194.153825][T0]  mempool_alloc+0x16b/0x300
[ 1194.165313][T0]  ? remove_element+0x160/0x160
[ 1194.176961][T0]  ? blk_mq_end_request+0x4b/0x490
[ 1194.188778][T0]  crypt_convert+0x27f6/0x45f0 [dm_crypt]
[ 1194.201024][T0]  ? rcu_read_lock_sched_held+0x3f/0x70
[ 1194.212906][T0]  ? module_assert_mutex_or_preempt+0x3e/0x70
[ 1194.225318][T0]  ? __module_address.part.0+0x1b/0x3a0
[ 1194.237212][T0]  ? is_kernel_percpu_address+0x5b/0x190
[ 1194.249238][T0]  ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [dm_crypt]
[ 1194.261593][T0]  ? is_module_address+0x25/0x40
[ 1194.272905][T0]  ? static_obj+0x8a/0xc0
[ 1194.283582][T0]  ? lockdep_init_map_waits+0x26a/0x700
[ 1194.295570][T0]  ? __raw_spin_lock_init+0x39/0x110
[ 1194.307330][T0]  kcryptd_crypt_read_convert+0x31c/0x560 [dm_crypt]
[ 1194.320496][T0]  ? kcryptd_queue_crypt+0x1be/0x380 [dm_crypt]
[ 1194.333203][T0]  blk_update_request+0x6d7/0x1500
[ 1194.344841][T0]  ? blk_mq_trigger_softirq+0x190/0x190
[ 1194.356831][T0]  blk_mq_end_request+0x4b/0x490
[ 1194.367994][T0]  ? blk_mq_trigger_softirq+0x190/0x190
[ 1194.379693][T0]  flush_smp_call_function_queue+0x24b/0x560
[ 1194.391847][T0]  flush_smp_call_function_from_idle+0x59/0x80
[ 1194.403969][T0]  do_idle+0x287/0x450
[ 1194.413891][T0]  ? arch_cpu_idle_exit+0x40/0x40
[ 1194.424716][T0]  ? lockdep_hardirqs_on_prepare+0x286/0x3f0
[ 1194.436399][T0]  ? _raw_spin_unlock_irqrestore+0x39/0x40
[ 1194.447759][T0]  cpu_startup_entry+0x19/0x20
[ 1194.458038][T0]  secondary_startup_64_no_verify+0xb0/0xbb

IO completion can be queued to a different CPU by the block subsystem as a "call
single function/data". The CPU may run these routines from the idle task, but it
does so with interrupts disabled.

It is not a good idea to do decryption with irqs disabled even in an idle task
context, so just defer it to a tasklet as with requests from hard irqs.

Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd 
workqueues")
Cc:  # v5.9+
Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 53791138d78b..c56b01bfdb60 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2091,8 +2091,11 @@ static void kcryptd_queue_crypt(struct dm_crypt_io *io)
 
if ((bio_data_dir(io->base_bio) == READ && 
test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags)) ||
(bio_data_dir(io->base_bio) == WRITE && 
test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags))) {
-   if (in_irq()) {
-   /* Crypto API's "skcipher_walk_first() refuses to work 
in hard IRQ context */
+   /* in_irq(): Crypto API's "skcipher_walk_first() refuses to 
work in hard IRQ context.
+* irqs_disabled(): the kernel may run some IO completion from 
the idle thread, but
+  it is being executed with irqs disabled.
+*/
+   if (in_irq() || irqs_disabled()) {
tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, 
(unsigned long)&io->work);
tasklet_schedule(&io->tasklet);
return;
-- 
2.20.1

Re: [PATCH 5.10 045/152] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

2021-01-18 Thread Ignat Korchagin

On Mon, Jan 18, 2021 at 10:44 PM Pavel Machek  wrote:
>
>
> > Fix this by allocating crypto requests with GFP_ATOMIC mask in
> > interrupt context.
> ...
>
> This one is wrong.
>
>
> > +++ b/drivers/md/dm-crypt.c
> > @@ -1454,13 +1454,16 @@ static int crypt_convert_block_skcipher(
> > - if (!ctx->r.req)
> > - ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO);
> > + if (!ctx->r.req) {
> > + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? 
> > GFP_ATOMIC : GFP_NOIO);
>
> Good so far. Ugly but good.
>
> > -static void crypt_alloc_req_aead(struct crypt_config *cc,
> > +static int crypt_alloc_req_aead(struct crypt_config *cc,
> >struct convert_context *ctx)
> >  {
> > - if (!ctx->r.req_aead)
> > - ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
> > + if (!ctx->r.req) {
> > + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? 
> > GFP_ATOMIC : GFP_NOIO);
> > + if (!ctx->r.req)
> > + return -ENOMEM;
> > + }
>
> But this one can't be good. We are now allocating different field in
> the structure!

Good catch! Sorry for the copy-paste. It is actually not a big deal,
because this is not a structure, but a union:
as long as the mempool was initialized with the correct size, it
should be no different.

Nevertheless, I'll send the patch to fix the typo.

Regards,
Ignat

> Pavel
>
> --
> DENX Software Engineering GmbH,  Managing Director:Wolfgang Denk
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194Groebenzell, Germany
>

[PATCH] dm crypt: fix invalid copy paste in crypt_alloc_req_aead

2021-01-19 Thread Ignat Korchagin

In commit d68b295 ("dm crypt: use GFP_ATOMIC when allocating crypto requests
from softirq") I wrongly copy pasted crypto request allocation code from
crypt_alloc_req_skcipher to crypt_alloc_req_aead. It is OK from runtime
perspective as both simple encryption request pointer and AEAD request pointer
are part of a union, but may confuse code reviewers.

Fixes: d68b295 ("dm crypt: use GFP_ATOMIC when allocating crypto requests from 
softirq")
Cc: sta...@vger.kernel.org # v5.9+
Reported-by: Pavel Machek 
Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 8c874710f0bc..5a55617a08e6 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1481,9 +1481,9 @@ static int crypt_alloc_req_skcipher(struct crypt_config 
*cc,
 static int crypt_alloc_req_aead(struct crypt_config *cc,
 struct convert_context *ctx)
 {
-   if (!ctx->r.req) {
-   ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? 
GFP_ATOMIC : GFP_NOIO);
-   if (!ctx->r.req)
+   if (!ctx->r.req_aead) {
+   ctx->r.req_aead = mempool_alloc(&cc->req_pool, in_interrupt() ? 
GFP_ATOMIC : GFP_NOIO);
+   if (!ctx->r.req_aead)
return -ENOMEM;
}
 
-- 
2.20.1

[PATCH] dm crypt: do not call bio_endio() from the dm-crypt tasklet

2021-01-09 Thread Ignat Korchagin

 fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb fb
[  398.370160][   C74] 
==

When the decryption fully completes in the tasklet, dm-crypt will call
bio_endio(), which in turn will call clone_endio() from dm.c core code. That
function frees the resources associated with the bio, including per bio private
structures. For dm-crypt it will free the current struct dm_crypt_io, which
contains our tasklet object, causing use-after-free, when the tasklet is being
dequeued by the kernel.

To avoid this, do not call bio_endio() from the current tasklet context, but
delay its execution to the dm-crypt IO workqueue.

Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd 
workqueues")
Cc:  # v5.9+
Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 53791138d78b..90f109d8a4a7 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1691,6 +1691,12 @@ static void crypt_inc_pending(struct dm_crypt_io *io)
atomic_inc(&io->io_pending);
 }
 
+static void kcryptd_io_bio_endio(struct work_struct *work)
+{
+   struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+   bio_endio(io->base_bio);
+}
+
 /*
  * One of the bios was finished. Check for completion of
  * the whole request and correctly clean up the buffer.
@@ -1713,7 +1719,23 @@ static void crypt_dec_pending(struct dm_crypt_io *io)
kfree(io->integrity_metadata);
 
base_bio->bi_status = error;
-   bio_endio(base_bio);
+
+   /*
+* If we are running this function from our tasklet,
+* we can't call bio_endio() here, because it will call
+* clone_endio() from dm.c, which in turn will
+* free the current struct dm_crypt_io structure with
+* our tasklet. In this case we need to delay bio_endio()
+* execution to after the tasklet is done and dequeued.
+*/
+   if (tasklet_trylock(&io->tasklet)) {
+   tasklet_unlock(&io->tasklet);
+   bio_endio(base_bio);
+   return;
+   }
+
+   INIT_WORK(&io->work, kcryptd_io_bio_endio);
+   queue_work(cc->io_queue, &io->work);
 }
 
 /*
-- 
2.20.1

[PATCH 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

2020-12-29 Thread Ignat Korchagin

/0xe50
[  210.194313][C0]  ? blk_queue_enter+0x6d0/0x6d0
[  210.195372][C0]  ? __bio_add_page+0x246/0x3d0
[  210.196418][C0]  ? bio_iov_iter_get_pages+0x7dd/0xbe0
[  210.197611][C0]  ? submit_bio+0xe2/0x460
[  210.198481][C0]  ? submit_bio_noacct+0xe50/0xe50
[  210.199496][C0]  ? free_unref_page_commit.constprop.0+0x130/0x330
[  210.200825][C0]  ? __blkdev_direct_IO_simple+0x43b/0x7e0
[  210.202050][C0]  ? bd_link_disk_holder+0x690/0x690
[  210.203239][C0]  ? put_pages_list+0x210/0x210
[  210.204341][C0]  ? scan_shadow_nodes+0xb0/0xb0
[  210.205472][C0]  ? _raw_write_lock_irqsave+0xe0/0xe0
[  210.206698][C0]  ? bd_may_claim+0xc0/0xc0
[  210.207715][C0]  ? zero_user_segments.constprop.0+0x2e0/0x2e0
[  210.209092][C0]  ? blkdev_direct_IO+0xd16/0x1020
[  210.210200][C0]  ? pagevec_lookup_range_tag+0x28/0x60
[  210.211416][C0]  ? __filemap_fdatawait_range+0xc4/0x1f0
[  210.212669][C0]  ? page_cache_next_miss+0x1e0/0x1e0
[  210.213842][C0]  ? generic_file_buffered_read+0x520/0x9e0
[  210.215128][C0]  ? delete_from_page_cache_batch+0x850/0x850
[  210.216470][C0]  ? bd_abort_claiming+0xd0/0xd0
[  210.217531][C0]  ? file_remove_privs+0x74/0x430
[  210.218589][C0]  ? filemap_check_errors+0x50/0xe0
[  210.219705][C0]  ? generic_file_direct_write+0x1a3/0x480
[  210.220979][C0]  ? __generic_file_write_iter+0x1d9/0x530
[  210.38][C0]  ? blkdev_write_iter+0x20d/0x3e0
[  210.223328][C0]  ? bd_unlink_disk_holder+0x360/0x360
[  210.224464][C0]  ? new_sync_write+0x37b/0x620
[  210.225511][C0]  ? new_sync_read+0x610/0x610
[  210.226539][C0]  ? _cond_resched+0x17/0x80
[  210.227539][C0]  ? inode_security+0x58/0x100
[  210.228582][C0]  ? security_file_permission+0x54/0x450
[  210.229796][C0]  ? vfs_write+0x524/0x770
[  210.230758][C0]  ? __x64_sys_pwrite64+0x197/0x1f0
[  210.231890][C0]  ? vfs_write+0x770/0x770
[  210.232869][C0]  ? do_syscall_64+0x33/0x40
[  210.233839][C0]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by allocating crypto requests with GFP_ATOMIC mask in interrupt context

Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd 
workqueues")
Reported-by: Maciej S. Szmigiero 
Cc:  # v5.9+
Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 5f9f9b3a226d..777b5c71a2f7 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1460,7 +1460,7 @@ static void crypt_alloc_req_skcipher(struct crypt_config 
*cc,
unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);
 
if (!ctx->r.req)
-   ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO);
+   ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? 
GFP_ATOMIC : GFP_NOIO);
 
skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
 
@@ -1477,7 +1477,7 @@ static void crypt_alloc_req_aead(struct crypt_config *cc,
 struct convert_context *ctx)
 {
if (!ctx->r.req_aead)
-   ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
+   ctx->r.req_aead = mempool_alloc(&cc->req_pool, in_interrupt() ? 
GFP_ATOMIC : GFP_NOIO);
 
aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]);
 
-- 
2.20.1

[PATCH 2/2] dm crypt: do not wait for backlogged crypto request completion in softirq

2020-12-29 Thread Ignat Korchagin

Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some code
paths in dm-crypt to be executed in softirq context, when the underlying driver
processes IO requests in interrupt/softirq context.

When Crypto API backlogs a crypto request, dm-crypt uses wait_for_completion to
avoid sending further requests to an already overloaded crypto driver. However,
if the code is executing in softirq context, we might get the following
stacktrace:

[  210.235213][C0] BUG: scheduling while atomic: fio/2602/0x0102
[  210.236701][C0] Modules linked in:
[  210.237566][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW 
5.10.0+ #50
[  210.239292][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 0.0.0 02/06/2015
[  210.241233][C0] Call Trace:
[  210.241946][C0]  
[  210.242561][C0]  dump_stack+0x7d/0xa3
[  210.243466][C0]  __schedule_bug.cold+0xb3/0xc2
[  210.244539][C0]  __schedule+0x156f/0x20d0
[  210.245518][C0]  ? io_schedule_timeout+0x140/0x140
[  210.246660][C0]  schedule+0xd0/0x270
[  210.247541][C0]  schedule_timeout+0x1fb/0x280
[  210.248586][C0]  ? usleep_range+0x150/0x150
[  210.249624][C0]  ? unpoison_range+0x3a/0x60
[  210.250632][C0]  ? kasan_kmalloc.constprop.0+0x82/0xa0
[  210.251949][C0]  ? unpoison_range+0x3a/0x60
[  210.252958][C0]  ? __prepare_to_swait+0xa7/0x190
[  210.254067][C0]  do_wait_for_common+0x2ab/0x370
[  210.255158][C0]  ? usleep_range+0x150/0x150
[  210.256192][C0]  ? bit_wait_io_timeout+0x160/0x160
[  210.257358][C0]  ? blk_update_request+0x757/0x1150
[  210.258582][C0]  ? _raw_spin_lock_irq+0x82/0xd0
[  210.259674][C0]  ? _raw_read_unlock_irqrestore+0x30/0x30
[  210.260917][C0]  wait_for_completion+0x4c/0x90
[  210.261971][C0]  crypt_convert+0x19a6/0x4c00
[  210.263033][C0]  ? _raw_spin_lock_irqsave+0x87/0xe0
[  210.264193][C0]  ? kasan_set_track+0x1c/0x30
[  210.265191][C0]  ? crypt_iv_tcw_ctr+0x4a0/0x4a0
[  210.266283][C0]  ? kmem_cache_free+0x104/0x470
[  210.267363][C0]  ? crypt_endio+0x91/0x180
[  210.268327][C0]  kcryptd_crypt_read_convert+0x30e/0x420
[  210.269565][C0]  blk_update_request+0x757/0x1150
[  210.270563][C0]  blk_mq_end_request+0x4b/0x480
[  210.271680][C0]  blk_done_softirq+0x21d/0x340
[  210.272775][C0]  ? _raw_spin_lock+0x81/0xd0
[  210.273847][C0]  ? blk_mq_stop_hw_queue+0x30/0x30
[  210.275031][C0]  ? _raw_read_lock_irq+0x40/0x40
[  210.276182][C0]  __do_softirq+0x190/0x611
[  210.277203][C0]  ? handle_edge_irq+0x221/0xb60
[  210.278340][C0]  asm_call_irq_on_stack+0x12/0x20
[  210.279514][C0]  
[  210.280164][C0]  do_softirq_own_stack+0x37/0x40
[  210.281281][C0]  irq_exit_rcu+0x110/0x1b0
[  210.282286][C0]  common_interrupt+0x74/0x120
[  210.283376][C0]  asm_common_interrupt+0x1e/0x40
[  210.284496][C0] RIP: 0010:_aesni_enc1+0x65/0xb0

Fix this by making crypt_convert function reentrant from the point of a single
bio and make dm-crypt defer further bio processing to a workqueue, if Crypto API
backlogs a request in interrupt context.

Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workq
ueues")
Cc:  # v5.9+

Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 102 +++---
 1 file changed, 97 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 777b5c71a2f7..6df907bd6c7c 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1529,13 +1529,19 @@ static void crypt_free_req(struct crypt_config *cc, 
void *req, struct bio *base_
  * Encrypt / decrypt data from one bio to another one (can be the same one)
  */
 static blk_status_t crypt_convert(struct crypt_config *cc,
-struct convert_context *ctx, bool atomic)
+struct convert_context *ctx, bool atomic, bool 
reset_pending)
 {
unsigned int tag_offset = 0;
unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
int r;
 
-   atomic_set(&ctx->cc_pending, 1);
+   /*
+* if reset_pending is set we are dealing with the bio for the first 
time,
+* else we're continuing to work on the previous bio, so don't mess with
+* the cc_pending counter
+*/
+   if (reset_pending)
+   atomic_set(&ctx->cc_pending, 1);
 
while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
 
@@ -1553,7 +1559,24 @@ static blk_status_t crypt_convert(struct crypt_config 
*cc,
 * but the driver request queue is full, let's wait.
 */
case -EBUSY:
-   wait_for_completion(&ctx->restart);
+   if (in_interrupt()) {
+   if (try_wait_for_completion(&ctx->restart)) {
+

Re: [PATCH 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

2020-12-30 Thread Ignat Korchagin

On Wed, Dec 30, 2020 at 7:36 AM Hillf Danton  wrote:
>
> On Tue, 29 Dec 2020 22:57:13 +
> >
> > Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd 
> > workqueues")
>
> Looks like a seperate fix to this commit is needed if what can be found
> at (Subject: [patch 00/12] UBS: Cleanup in_interupt/in_irq/in_atomic() usage)
> https://lore.kernel.org/lkml/20201014145215.518912...@linutronix.de/

I think the above request should be satisfied by device mapper core
code itself rather than individual DM module implementations, as the
execution context in the module is dependent on the underlying block
driver: some block drivers complete requests in task contexts and some
in interrupt - but the underlying block drivers should be transparent
to the modules.

The device mapper core code can pass context information to the
modules if we are to avoid in_*irq() marcos in the code.

> is correct.
>
> > Reported-by: Maciej S. Szmigiero 
> > Cc:  # v5.9+
> > Signed-off-by: Ignat Korchagin 
> > ---
> >  drivers/md/dm-crypt.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > index 5f9f9b3a226d..777b5c71a2f7 100644
> > --- a/drivers/md/dm-crypt.c
> > +++ b/drivers/md/dm-crypt.c
> > @@ -1460,7 +1460,7 @@ static void crypt_alloc_req_skcipher(struct 
> > crypt_config *cc,
> >   unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);
> >
> >   if (!ctx->r.req)
> > - ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO);
> > + ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? 
> > GFP_ATOMIC : GFP_NOIO);
> >
> >   skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
> >
> > @@ -1477,7 +1477,7 @@ static void crypt_alloc_req_aead(struct crypt_config 
> > *cc,
> >struct convert_context *ctx)
> >  {
> >   if (!ctx->r.req_aead)
> > - ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
> > + ctx->r.req_aead = mempool_alloc(&cc->req_pool, in_interrupt() 
> > ? GFP_ATOMIC : GFP_NOIO);
> >
> >   aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]);
> >
> > --
> > 2.20.1

[PATCH v2 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

2020-12-30 Thread Ignat Korchagin

/0xe50
[  210.194313][C0]  ? blk_queue_enter+0x6d0/0x6d0
[  210.195372][C0]  ? __bio_add_page+0x246/0x3d0
[  210.196418][C0]  ? bio_iov_iter_get_pages+0x7dd/0xbe0
[  210.197611][C0]  ? submit_bio+0xe2/0x460
[  210.198481][C0]  ? submit_bio_noacct+0xe50/0xe50
[  210.199496][C0]  ? free_unref_page_commit.constprop.0+0x130/0x330
[  210.200825][C0]  ? __blkdev_direct_IO_simple+0x43b/0x7e0
[  210.202050][C0]  ? bd_link_disk_holder+0x690/0x690
[  210.203239][C0]  ? put_pages_list+0x210/0x210
[  210.204341][C0]  ? scan_shadow_nodes+0xb0/0xb0
[  210.205472][C0]  ? _raw_write_lock_irqsave+0xe0/0xe0
[  210.206698][C0]  ? bd_may_claim+0xc0/0xc0
[  210.207715][C0]  ? zero_user_segments.constprop.0+0x2e0/0x2e0
[  210.209092][C0]  ? blkdev_direct_IO+0xd16/0x1020
[  210.210200][C0]  ? pagevec_lookup_range_tag+0x28/0x60
[  210.211416][C0]  ? __filemap_fdatawait_range+0xc4/0x1f0
[  210.212669][C0]  ? page_cache_next_miss+0x1e0/0x1e0
[  210.213842][C0]  ? generic_file_buffered_read+0x520/0x9e0
[  210.215128][C0]  ? delete_from_page_cache_batch+0x850/0x850
[  210.216470][C0]  ? bd_abort_claiming+0xd0/0xd0
[  210.217531][C0]  ? file_remove_privs+0x74/0x430
[  210.218589][C0]  ? filemap_check_errors+0x50/0xe0
[  210.219705][C0]  ? generic_file_direct_write+0x1a3/0x480
[  210.220979][C0]  ? __generic_file_write_iter+0x1d9/0x530
[  210.38][C0]  ? blkdev_write_iter+0x20d/0x3e0
[  210.223328][C0]  ? bd_unlink_disk_holder+0x360/0x360
[  210.224464][C0]  ? new_sync_write+0x37b/0x620
[  210.225511][C0]  ? new_sync_read+0x610/0x610
[  210.226539][C0]  ? _cond_resched+0x17/0x80
[  210.227539][C0]  ? inode_security+0x58/0x100
[  210.228582][C0]  ? security_file_permission+0x54/0x450
[  210.229796][C0]  ? vfs_write+0x524/0x770
[  210.230758][C0]  ? __x64_sys_pwrite64+0x197/0x1f0
[  210.231890][C0]  ? vfs_write+0x770/0x770
[  210.232869][C0]  ? do_syscall_64+0x33/0x40
[  210.233839][C0]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by allocating crypto requests with GFP_ATOMIC mask in interrupt context

Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd 
workqueues")
Reported-by: Maciej S. Szmigiero 
Cc:  # v5.9+
Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 33 +++--
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 53791138d78b..e4fd690c70e1 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1454,13 +1454,16 @@ static int crypt_convert_block_skcipher(struct 
crypt_config *cc,
 static void kcryptd_async_done(struct crypto_async_request *async_req,
   int error);
 
-static void crypt_alloc_req_skcipher(struct crypt_config *cc,
+static int crypt_alloc_req_skcipher(struct crypt_config *cc,
 struct convert_context *ctx)
 {
unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);
 
-   if (!ctx->r.req)
-   ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO);
+   if (!ctx->r.req) {
+   ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? 
GFP_ATOMIC : GFP_NOIO);
+   if (!ctx->r.req)
+   return -ENOMEM;
+   }
 
skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
 
@@ -1471,13 +1474,18 @@ static void crypt_alloc_req_skcipher(struct 
crypt_config *cc,
skcipher_request_set_callback(ctx->r.req,
CRYPTO_TFM_REQ_MAY_BACKLOG,
kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
+
+   return 0;
 }
 
-static void crypt_alloc_req_aead(struct crypt_config *cc,
+static int crypt_alloc_req_aead(struct crypt_config *cc,
 struct convert_context *ctx)
 {
-   if (!ctx->r.req_aead)
-   ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
+   if (!ctx->r.req) {
+   ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? 
GFP_ATOMIC : GFP_NOIO);
+   if (!ctx->r.req)
+   return -ENOMEM;
+   }
 
aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]);
 
@@ -1488,15 +1496,17 @@ static void crypt_alloc_req_aead(struct crypt_config 
*cc,
aead_request_set_callback(ctx->r.req_aead,
CRYPTO_TFM_REQ_MAY_BACKLOG,
kcryptd_async_done, dmreq_of_req(cc, ctx->r.req_aead));
+
+   return 0;
 }
 
-static void crypt_alloc_req(struct crypt_config *cc,
+static int crypt_alloc_req(struct crypt_config *cc,
struct convert_context *ctx)
 {
if (crypt_integrity_aead(cc))
-   crypt_alloc_req_aead(cc, ctx);
+   return crypt_alloc_req_aead(cc, ctx);
else
-   cr

[PATCH v2 2/2] dm crypt: do not wait for backlogged crypto request completion in softirq

2020-12-30 Thread Ignat Korchagin

Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some code
paths in dm-crypt to be executed in softirq context, when the underlying driver
processes IO requests in interrupt/softirq context.

When Crypto API backlogs a crypto request, dm-crypt uses wait_for_completion to
avoid sending further requests to an already overloaded crypto driver. However,
if the code is executing in softirq context, we might get the following
stacktrace:

[  210.235213][C0] BUG: scheduling while atomic: fio/2602/0x0102
[  210.236701][C0] Modules linked in:
[  210.237566][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW 
5.10.0+ #50
[  210.239292][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 0.0.0 02/06/2015
[  210.241233][C0] Call Trace:
[  210.241946][C0]  
[  210.242561][C0]  dump_stack+0x7d/0xa3
[  210.243466][C0]  __schedule_bug.cold+0xb3/0xc2
[  210.244539][C0]  __schedule+0x156f/0x20d0
[  210.245518][C0]  ? io_schedule_timeout+0x140/0x140
[  210.246660][C0]  schedule+0xd0/0x270
[  210.247541][C0]  schedule_timeout+0x1fb/0x280
[  210.248586][C0]  ? usleep_range+0x150/0x150
[  210.249624][C0]  ? unpoison_range+0x3a/0x60
[  210.250632][C0]  ? kasan_kmalloc.constprop.0+0x82/0xa0
[  210.251949][C0]  ? unpoison_range+0x3a/0x60
[  210.252958][C0]  ? __prepare_to_swait+0xa7/0x190
[  210.254067][C0]  do_wait_for_common+0x2ab/0x370
[  210.255158][C0]  ? usleep_range+0x150/0x150
[  210.256192][C0]  ? bit_wait_io_timeout+0x160/0x160
[  210.257358][C0]  ? blk_update_request+0x757/0x1150
[  210.258582][C0]  ? _raw_spin_lock_irq+0x82/0xd0
[  210.259674][C0]  ? _raw_read_unlock_irqrestore+0x30/0x30
[  210.260917][C0]  wait_for_completion+0x4c/0x90
[  210.261971][C0]  crypt_convert+0x19a6/0x4c00
[  210.263033][C0]  ? _raw_spin_lock_irqsave+0x87/0xe0
[  210.264193][C0]  ? kasan_set_track+0x1c/0x30
[  210.265191][C0]  ? crypt_iv_tcw_ctr+0x4a0/0x4a0
[  210.266283][C0]  ? kmem_cache_free+0x104/0x470
[  210.267363][C0]  ? crypt_endio+0x91/0x180
[  210.268327][C0]  kcryptd_crypt_read_convert+0x30e/0x420
[  210.269565][C0]  blk_update_request+0x757/0x1150
[  210.270563][C0]  blk_mq_end_request+0x4b/0x480
[  210.271680][C0]  blk_done_softirq+0x21d/0x340
[  210.272775][C0]  ? _raw_spin_lock+0x81/0xd0
[  210.273847][C0]  ? blk_mq_stop_hw_queue+0x30/0x30
[  210.275031][C0]  ? _raw_read_lock_irq+0x40/0x40
[  210.276182][C0]  __do_softirq+0x190/0x611
[  210.277203][C0]  ? handle_edge_irq+0x221/0xb60
[  210.278340][C0]  asm_call_irq_on_stack+0x12/0x20
[  210.279514][C0]  
[  210.280164][C0]  do_softirq_own_stack+0x37/0x40
[  210.281281][C0]  irq_exit_rcu+0x110/0x1b0
[  210.282286][C0]  common_interrupt+0x74/0x120
[  210.283376][C0]  asm_common_interrupt+0x1e/0x40
[  210.284496][C0] RIP: 0010:_aesni_enc1+0x65/0xb0

Fix this by making crypt_convert function reentrant from the point of a single
bio and make dm-crypt defer further bio processing to a workqueue, if Crypto API
backlogs a request in interrupt context.

Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workq
ueues")
Cc:  # v5.9+

Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 102 +++---
 1 file changed, 97 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index e4fd690c70e1..0d939fbe51af 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1539,13 +1539,19 @@ static void crypt_free_req(struct crypt_config *cc, 
void *req, struct bio *base_
  * Encrypt / decrypt data from one bio to another one (can be the same one)
  */
 static blk_status_t crypt_convert(struct crypt_config *cc,
-struct convert_context *ctx, bool atomic)
+struct convert_context *ctx, bool atomic, bool 
reset_pending)
 {
unsigned int tag_offset = 0;
unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
int r;
 
-   atomic_set(&ctx->cc_pending, 1);
+   /*
+* if reset_pending is set we are dealing with the bio for the first 
time,
+* else we're continuing to work on the previous bio, so don't mess with
+* the cc_pending counter
+*/
+   if (reset_pending)
+   atomic_set(&ctx->cc_pending, 1);
 
while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
 
@@ -1566,7 +1572,24 @@ static blk_status_t crypt_convert(struct crypt_config 
*cc,
 * but the driver request queue is full, let's wait.
 */
case -EBUSY:
-   wait_for_completion(&ctx->restart);
+   if (in_interrupt()) {
+   if (try_wait_for_completion(&ctx->restart)) {
+

[PATCH v2 0/2] dm crypt: some fixes to support dm-crypt running in softirq context

2020-12-30 Thread Ignat Korchagin

Changes from v1:
0001: Handle memory allocation failure for GFP_ATOMIC

Ignat Korchagin (2):
  dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
  dm crypt: do not wait for backlogged crypto request completion in
softirq

 drivers/md/dm-crypt.c | 135 +-
 1 file changed, 120 insertions(+), 15 deletions(-)

-- 
2.20.1

Re: [PATCH 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

2020-12-30 Thread Ignat Korchagin

Yes, good call!

Reposted with allocation failure handling.

Thanks,
Ignat

On Wed, Dec 30, 2020 at 6:11 PM Mikulas Patocka  wrote:
>
> Hi
>
> This patch doesn't handle allocation failure gracefully.
>
> Mikulas
>
>
>
> On Tue, 29 Dec 2020, Ignat Korchagin wrote:
>
> > Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some 
> > code
> > paths in dm-crypt to be executed in softirq context, when the underlying 
> > driver
> > processes IO requests in interrupt/softirq context.
> >
> > In this case sometimes when allocating a new crypto request we may get a
> > stacktrace like below:
> >
> > [  210.103008][C0] BUG: sleeping function called from invalid context 
> > at mm/mempool.c:381
> > [  210.104746][C0] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, 
> > pid: 2602, name: fio
> > [  210.106599][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW   
> >   5.10.0+ #50
> > [  210.108331][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 
> > 1996), BIOS 0.0.0 02/06/2015
> > [  210.110212][C0] Call Trace:
> > [  210.110921][C0]  
> > [  210.111527][C0]  dump_stack+0x7d/0xa3
> > [  210.112411][C0]  ___might_sleep.cold+0x122/0x151
> > [  210.113527][C0]  mempool_alloc+0x16b/0x2f0
> > [  210.114524][C0]  ? __queue_work+0x515/0xde0
> > [  210.115553][C0]  ? mempool_resize+0x700/0x700
> > [  210.116586][C0]  ? crypt_endio+0x91/0x180
> > [  210.117479][C0]  ? blk_update_request+0x757/0x1150
> > [  210.118513][C0]  ? blk_mq_end_request+0x4b/0x480
> > [  210.119572][C0]  ? blk_done_softirq+0x21d/0x340
> > [  210.120628][C0]  ? __do_softirq+0x190/0x611
> > [  210.121626][C0]  crypt_convert+0x29f9/0x4c00
> > [  210.122668][C0]  ? _raw_spin_lock_irqsave+0x87/0xe0
> > [  210.123824][C0]  ? kasan_set_track+0x1c/0x30
> > [  210.124858][C0]  ? crypt_iv_tcw_ctr+0x4a0/0x4a0
> > [  210.125930][C0]  ? kmem_cache_free+0x104/0x470
> > [  210.126973][C0]  ? crypt_endio+0x91/0x180
> > [  210.127947][C0]  kcryptd_crypt_read_convert+0x30e/0x420
> > [  210.129165][C0]  blk_update_request+0x757/0x1150
> > [  210.130231][C0]  blk_mq_end_request+0x4b/0x480
> > [  210.131294][C0]  blk_done_softirq+0x21d/0x340
> > [  210.132332][C0]  ? _raw_spin_lock+0x81/0xd0
> > [  210.133289][C0]  ? blk_mq_stop_hw_queue+0x30/0x30
> > [  210.134399][C0]  ? _raw_read_lock_irq+0x40/0x40
> > [  210.135458][C0]  __do_softirq+0x190/0x611
> > [  210.136409][C0]  ? handle_edge_irq+0x221/0xb60
> > [  210.137447][C0]  asm_call_irq_on_stack+0x12/0x20
> > [  210.138507][C0]  
> > [  210.139118][C0]  do_softirq_own_stack+0x37/0x40
> > [  210.140191][C0]  irq_exit_rcu+0x110/0x1b0
> > [  210.141151][C0]  common_interrupt+0x74/0x120
> > [  210.142171][C0]  asm_common_interrupt+0x1e/0x40
> > [  210.143206][C0] RIP: 0010:_aesni_enc1+0x65/0xb0
> > [  210.144313][C0] Code: 38 dc c2 41 0f 28 52 d0 66 0f 38 dc c2 41 0f 
> > 28 52 e0 66 0f 38 dc c2 41 0f 28 52 f0 66 0f 38 dc c2 41 0f 28 12 66 0f 38 
> > dc c2 <41> 0f 28 52 10 66 0f 38 dc c2 41 0f 28 52 20 66 0f 38 dc c2 41 0f
> > [  210.148542][C0] RSP: 0018:88810dbe6db0 EFLAGS: 0286
> > [  210.149842][C0] RAX: 9a90cdc0 RBX:  RCX: 
> > 0200
> > [  210.151576][C0] RDX: 888101e5f240 RSI: 888101e5f240 RDI: 
> > 888b5020
> > [  210.153339][C0] RBP: 88810dbe6e20 R08:  R09: 
> > 0020
> > [  210.155063][C0] R10: 888b5090 R11: 111021b7cdcc R12: 
> > 9e87cd40
> > [  210.156791][C0] R13: 888b5210 R14: 888101e5f0d8 R15: 
> > 
> > [  210.158497][C0]  ? aesni_set_key+0x1e0/0x1e0
> > [  210.159523][C0]  ? aesni_enc+0xf/0x20
> > [  210.160408][C0]  ? glue_xts_req_128bit+0x181/0x6f0
> > [  210.161571][C0]  ? aesni_set_key+0x1e0/0x1e0
> > [  210.162560][C0]  ? glue_ctr_req_128bit+0x630/0x630
> > [  210.163706][C0]  ? kasan_save_stack+0x37/0x50
> > [  210.164761][C0]  ? kasan_save_stack+0x20/0x50
> > [  210.165786][C0]  ? get_page_from_freelist+0x2052/0x36a0
> > [  210.167024][C0]  ? __blkdev_direct_IO_simple+0x43b/0x7e0
> > [  210.168288][C0]  ? blkdev_direct_IO+0xd16/0x1020
> > [  210.169396][C0]  ? generic_file_direct_write+0x1a3/0x480
> > [  210.170648][C0]  ? __generic_file_write_iter+0x1d9/0x530
> > [  210.171882][C0]  ? bl

Re: dm-crypt with no_read_workqueue and no_write_workqueue + btrfs scrub = BUG()

2020-12-23 Thread Ignat Korchagin

On Wed, Dec 23, 2020 at 3:37 PM Maciej S. Szmigiero
 wrote:
>
> On 14.12.2020 19:11, Maciej S. Szmigiero wrote:
> > Hi,
> >
> > I hit a reproducible BUG() when scrubbing a btrfs fs on top of
> > a dm-crypt device with no_read_workqueue and no_write_workqueue
> > flags enabled.
>
> Still happens on the current torvalds/master.
>
> Due to this bug it is not possible to use btrfs on top of
> a dm-crypt device with no_read_workqueue and no_write_workqueue
> flags enabled.
>
> @Ignat:
> Can you have a look at this as the person who added these flags?

I've been looking into this for the last couple of days because of
other reports [1].
Just finished testing a possible solution. Will submit soon.

> It looks like to me that the skcipher API might not be safe to
> call from a softirq context, after all.

It is less about skcipher API and more about how dm-crypt uses it as
well as some assumptions that it is always running in context which
can sleep.

> Maciej

Ignat

[1]: https://github.com/cloudflare/linux/issues/1

Re: dm-crypt with no_read_workqueue and no_write_workqueue + btrfs scrub = BUG()

2020-12-23 Thread Ignat Korchagin

On Wed, Dec 23, 2020 at 9:20 PM Maciej S. Szmigiero
 wrote:
>
> On 23.12.2020 22:09, Ignat Korchagin wrote:
> (..)
> > I've been looking into this for the last couple of days because of
> > other reports [1].
> > Just finished testing a possible solution. Will submit soon.
>
> Thanks for looking into it.
>
> By the way, on a bare metal I am actually hitting a different problem
> (scheduling while atomic) when scrubbing a btrfs filesystem, just as one
> of your users from that GitHub report had [1].

That is because dm-crypt calls "wait_for_completion" in rare cases
when Crypto API (cryptd) backlogs the decryption request. I've
reproduced that one as well (even with no FS).
We never hit these problems in the original testing probably due to
the fact we mostly used xtsproxy custom crypto module, which is
totally synchronous.
I did test it later with standard crypto, but did not encounter these
problems as well most likely because it is also depends which storage
driver underneath we are using: most of them do not submit read
requests to dm-crypt in irq/softirq context in the first place

> I've pasted that backtrace in my original Dec 14 message.
>
> Thanks,
> Maciej
>
> [1]: https://github.com/cloudflare/linux/issues/1#issuecomment-736734243

Regards,
Ignat

Re: dm-crypt with no_read_workqueue and no_write_workqueue + btrfs scrub = BUG()

2020-12-24 Thread Ignat Korchagin

On Wed, Dec 23, 2020 at 8:57 PM Herbert Xu  wrote:
>
> On Wed, Dec 23, 2020 at 04:37:34PM +0100, Maciej S. Szmigiero wrote:
> >
> > It looks like to me that the skcipher API might not be safe to
> > call from a softirq context, after all.
>
> skcipher is safe to use in a softirq.  The problem is only in
> dm-crypt where it tries to allocate memory with GFP_NOIO.

Hm.. After eliminating the GFP_NOIO (as well as some other sleeping
paths) from dm-crypt softirq code I still hit an occasional crash in
my extreme setup (QEMU with 1 CPU and cryptd_max_cpu_qlen set to 1)
(decoded with stacktrace_decode.sh):

[   89.324723] BUG: kernel NULL pointer dereference, address: 0008
[   89.325713] #PF: supervisor write access in kernel mode
[   89.326460] #PF: error_code(0x0002) - not-present page
[   89.327211] PGD 0 P4D 0
[   89.327589] Oops: 0002 [#1] PREEMPT SMP PTI
[   89.328200] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 5.10.0+ #79
[   89.329109] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 0.0.0 02/06/2015
[   89.330284] Workqueue: cryptd cryptd_queue_worker
[   89.330999] RIP: 0010:crypto_dequeue_request
(/cfsetup_build/./include/linux/list.h:112
/cfsetup_build/./include/linux/list.h:135
/cfsetup_build/./include/linux/list.h:146
/cfsetup_build/crypto/algapi.c:957)
[ 89.331757] Code: e9 c9 d0 a8 48 c7 c7 f9 c9 d0 a8 e8 c2 88 fe ff 4c
8b 23 48 c7 c6 e9 c9 d0 a8 48 c7 c7 f9 c9 d0 a8 49 8b 14 24 49 8b 44
24 08 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 04 24
48
All code

   0: e9 c9 d0 a8 48jmpq   0x48a8d0ce
   5: c7 c7 f9 c9 d0 a8mov$0xa8d0c9f9,%edi
   b: e8 c2 88 fe ffcallq  0xfffe88d2
  10: 4c 8b 23  mov(%rbx),%r12
  13: 48 c7 c6 e9 c9 d0 a8 mov$0xa8d0c9e9,%rsi
  1a: 48 c7 c7 f9 c9 d0 a8 mov$0xa8d0c9f9,%rdi
  21: 49 8b 14 24  mov(%r12),%rdx
  25: 49 8b 44 24 08mov0x8(%r12),%rax
  2a:* 48 89 42 08  mov%rax,0x8(%rdx) <-- trapping instruction
  2e: 48 89 10  mov%rdx,(%rax)
  31: 48 b8 00 01 00 00 00 movabs $0xdead0100,%rax
  38: 00 ad de
  3b: 49 89 04 24  mov%rax,(%r12)
  3f: 48rex.W

Code starting with the faulting instruction
===
   0: 48 89 42 08  mov%rax,0x8(%rdx)
   4: 48 89 10  mov%rdx,(%rax)
   7: 48 b8 00 01 00 00 00 movabs $0xdead0100,%rax
   e: 00 ad de
  11: 49 89 04 24  mov%rax,(%r12)
  15: 48rex.W
[   89.334414] RSP: 0018:ba64c00bbe68 EFLAGS: 00010246
[   89.335165] RAX:  RBX: 9b9d6fc28d88 RCX: 
[   89.336182] RDX:  RSI: a8d0c9e9 RDI: a8d0c9f9
[   89.337204] RBP:  R08: a906e708 R09: 0058
[   89.338208] R10: a9068720 R11: fc00 R12: 9b9a43797478
[   89.339216] R13: 0020 R14: 9b9d6fc28e00 R15: 
[   89.340231] FS:  () GS:9b9d6fc0()
knlGS:
[   89.341376] CS:  0010 DS:  ES:  CR0: 80050033
[   89.342207] CR2: 0008 CR3: 00014cd76002 CR4: 00170ef0
[   89.343238] Call Trace:
[   89.343609] cryptd_queue_worker (/cfsetup_build/crypto/cryptd.c:172)
[   89.344218] process_one_work
(/cfsetup_build/./arch/x86/include/asm/preempt.h:26
/cfsetup_build/kernel/workqueue.c:2284)
[   89.344821] ? rescuer_thread (/cfsetup_build/kernel/workqueue.c:2364)
[   89.345399] worker_thread
(/cfsetup_build/./include/linux/list.h:282
/cfsetup_build/kernel/workqueue.c:2422)
[   89.345923] ? rescuer_thread (/cfsetup_build/kernel/workqueue.c:2364)
[   89.346504] kthread (/cfsetup_build/kernel/kthread.c:292)
[   89.346986] ? kthread_create_worker_on_cpu
(/cfsetup_build/kernel/kthread.c:245)
[   89.347713] ret_from_fork (/cfsetup_build/arch/x86/entry/entry_64.S:302)
[   89.348255] Modules linked in:
[   89.348708] CR2: 0008
[   89.349197] ---[ end trace b7e9618b4122ed3b ]---
[   89.349863] RIP: 0010:crypto_dequeue_request
(/cfsetup_build/./include/linux/list.h:112
/cfsetup_build/./include/linux/list.h:135
/cfsetup_build/./include/linux/list.h:146
/cfsetup_build/crypto/algapi.c:957)
[ 89.350606] Code: e9 c9 d0 a8 48 c7 c7 f9 c9 d0 a8 e8 c2 88 fe ff 4c
8b 23 48 c7 c6 e9 c9 d0 a8 48 c7 c7 f9 c9 d0 a8 49 8b 14 24 49 8b 44
24 08 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 04 24
48
All code

   0: e9 c9 d0 a8 48jmpq   0x48a8d0ce
   5: c7 c7 f9 c9 d0 a8mov$0xa8d0c9f9,%edi
   b: e8 c2 88 fe ffcallq  0xfffe88d2
  10: 4c 8b 23  mov(%rbx),%r12
  13: 48 c7 c6 e9 c9 d0 a8 mov$0xa8d0c9e9,%rsi
  1a: 48 c7 c7 f9 c9 d0 a8 mov$0xa8d0c9f9,%rdi
  21: 49 8b 14 24  mov(%r12),%rdx
  25: 49 8b 44 24 08mov0x8(%r12),%rax
  2a:* 48 89 42 08  mov%rax,0x8(%rdx) <--

Re: [PATCH v2 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

2021-01-01 Thread Ignat Korchagin

On Fri, Jan 1, 2021 at 10:00 AM Mikulas Patocka  wrote:
>
>
>
> On Wed, 30 Dec 2020, Ignat Korchagin wrote:
>
> > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > index 53791138d78b..e4fd690c70e1 100644
> > --- a/drivers/md/dm-crypt.c
> > +++ b/drivers/md/dm-crypt.c
> > @@ -1539,7 +1549,10 @@ static blk_status_t crypt_convert(struct 
> > crypt_config *cc,
> >
> >   while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
> >
> > - crypt_alloc_req(cc, ctx);
> > + r = crypt_alloc_req(cc, ctx);
> > + if (r)
> > + return BLK_STS_RESOURCE;
> > +
> >   atomic_inc(&ctx->cc_pending);
> >
> >   if (crypt_integrity_aead(cc))
> > --
> > 2.20.1
>
> I'm not quite convinced that returning BLK_STS_RESOURCE will help. The
> block layer will convert this value back to -ENOMEM and return it to the
> caller, resulting in an I/O error.
>
> Note that GFP_ATOMIC allocations may fail anytime and you must handle
> allocation failure gracefully - i.e. process the request without any
> error.
>
> An acceptable solution would be to punt the request to a workqueue and do
> GFP_NOIO allocation from the workqueue. Or add the request to some list
> and process the list when some other request completes.

We can do the workqueue, if that's the desired behaviour. The second patch
from this patchset already adds the code for the request to be retried from the
workqueue if crypt_convert returns BLK_STS_DEV_RESOURCE, so all is needed
is to change returning BLK_STS_RESOURCE to BLK_STS_DEV_RESOURCE
here. Does that sound reasonable?

> You should write a test that simulates allocation failure and verify that
> the kernel handles it gracefully without any I/O error.

I already have the test for the second patch in the set, but will
adapt it to make sure the
allocation failure codepath is covered as well.

> Mikulas
>

[PATCH v3 1/2] dm crypt: do not wait for backlogged crypto request completion in softirq

2021-01-04 Thread Ignat Korchagin

Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some code
paths in dm-crypt to be executed in softirq context, when the underlying driver
processes IO requests in interrupt/softirq context.

When Crypto API backlogs a crypto request, dm-crypt uses wait_for_completion to
avoid sending further requests to an already overloaded crypto driver. However,
if the code is executing in softirq context, we might get the following
stacktrace:

[  210.235213][C0] BUG: scheduling while atomic: fio/2602/0x0102
[  210.236701][C0] Modules linked in:
[  210.237566][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW 
5.10.0+ #50
[  210.239292][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 0.0.0 02/06/2015
[  210.241233][C0] Call Trace:
[  210.241946][C0]  
[  210.242561][C0]  dump_stack+0x7d/0xa3
[  210.243466][C0]  __schedule_bug.cold+0xb3/0xc2
[  210.244539][C0]  __schedule+0x156f/0x20d0
[  210.245518][C0]  ? io_schedule_timeout+0x140/0x140
[  210.246660][C0]  schedule+0xd0/0x270
[  210.247541][C0]  schedule_timeout+0x1fb/0x280
[  210.248586][C0]  ? usleep_range+0x150/0x150
[  210.249624][C0]  ? unpoison_range+0x3a/0x60
[  210.250632][C0]  ? kasan_kmalloc.constprop.0+0x82/0xa0
[  210.251949][C0]  ? unpoison_range+0x3a/0x60
[  210.252958][C0]  ? __prepare_to_swait+0xa7/0x190
[  210.254067][C0]  do_wait_for_common+0x2ab/0x370
[  210.255158][C0]  ? usleep_range+0x150/0x150
[  210.256192][C0]  ? bit_wait_io_timeout+0x160/0x160
[  210.257358][C0]  ? blk_update_request+0x757/0x1150
[  210.258582][C0]  ? _raw_spin_lock_irq+0x82/0xd0
[  210.259674][C0]  ? _raw_read_unlock_irqrestore+0x30/0x30
[  210.260917][C0]  wait_for_completion+0x4c/0x90
[  210.261971][C0]  crypt_convert+0x19a6/0x4c00
[  210.263033][C0]  ? _raw_spin_lock_irqsave+0x87/0xe0
[  210.264193][C0]  ? kasan_set_track+0x1c/0x30
[  210.265191][C0]  ? crypt_iv_tcw_ctr+0x4a0/0x4a0
[  210.266283][C0]  ? kmem_cache_free+0x104/0x470
[  210.267363][C0]  ? crypt_endio+0x91/0x180
[  210.268327][C0]  kcryptd_crypt_read_convert+0x30e/0x420
[  210.269565][C0]  blk_update_request+0x757/0x1150
[  210.270563][C0]  blk_mq_end_request+0x4b/0x480
[  210.271680][C0]  blk_done_softirq+0x21d/0x340
[  210.272775][C0]  ? _raw_spin_lock+0x81/0xd0
[  210.273847][C0]  ? blk_mq_stop_hw_queue+0x30/0x30
[  210.275031][C0]  ? _raw_read_lock_irq+0x40/0x40
[  210.276182][C0]  __do_softirq+0x190/0x611
[  210.277203][C0]  ? handle_edge_irq+0x221/0xb60
[  210.278340][C0]  asm_call_irq_on_stack+0x12/0x20
[  210.279514][C0]  
[  210.280164][C0]  do_softirq_own_stack+0x37/0x40
[  210.281281][C0]  irq_exit_rcu+0x110/0x1b0
[  210.282286][C0]  common_interrupt+0x74/0x120
[  210.283376][C0]  asm_common_interrupt+0x1e/0x40
[  210.284496][C0] RIP: 0010:_aesni_enc1+0x65/0xb0

Fix this by making crypt_convert function reentrant from the point of a single
bio and make dm-crypt defer further bio processing to a workqueue, if Crypto API
backlogs a request in interrupt context.

Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workq
ueues")
Cc:  # v5.9+

Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 103 --
 1 file changed, 98 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 392337f16ecf..a3326dadfd4d 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1529,13 +1529,19 @@ static void crypt_free_req(struct crypt_config *cc, 
void *req, struct bio *base_
  * Encrypt / decrypt data from one bio to another one (can be the same one)
  */
 static blk_status_t crypt_convert(struct crypt_config *cc,
-struct convert_context *ctx, bool atomic)
+struct convert_context *ctx, bool atomic, bool 
reset_pending)
 {
unsigned int tag_offset = 0;
unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
int r;
 
-   atomic_set(&ctx->cc_pending, 1);
+   /*
+* if reset_pending is set we are dealing with the bio for the first 
time,
+* else we're continuing to work on the previous bio, so don't mess with
+* the cc_pending counter
+*/
+   if (reset_pending)
+   atomic_set(&ctx->cc_pending, 1);
 
while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
 
@@ -1553,7 +1559,25 @@ static blk_status_t crypt_convert(struct crypt_config 
*cc,
 * but the driver request queue is full, let's wait.
 */
case -EBUSY:
-   wait_for_completion(&ctx->restart);
+   if (in_interrupt()) {
+   if (try_wait_for_completion(&ctx->restart)) {
+

[PATCH v3 0/2] dm crypt: some fixes to support dm-crypt running in softirq context

2021-01-04 Thread Ignat Korchagin

Changes from v1:
  * 0001: handle memory allocation failure for GFP_ATOMIC

Changes from v2:
  * reordered patches
  * 0002: crypt_convert will be retried from a workqueue, when a crypto request
allocation fails with GFP_ATOMIC instead of just returning an IO error to
the user

Ignat Korchagin (2):
  dm crypt: do not wait for backlogged crypto request completion in
softirq
  dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

 drivers/md/dm-crypt.c | 138 +-
 1 file changed, 123 insertions(+), 15 deletions(-)

-- 
2.20.1

[PATCH v3 2/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

2021-01-04 Thread Ignat Korchagin

/0xe50
[  210.194313][C0]  ? blk_queue_enter+0x6d0/0x6d0
[  210.195372][C0]  ? __bio_add_page+0x246/0x3d0
[  210.196418][C0]  ? bio_iov_iter_get_pages+0x7dd/0xbe0
[  210.197611][C0]  ? submit_bio+0xe2/0x460
[  210.198481][C0]  ? submit_bio_noacct+0xe50/0xe50
[  210.199496][C0]  ? free_unref_page_commit.constprop.0+0x130/0x330
[  210.200825][C0]  ? __blkdev_direct_IO_simple+0x43b/0x7e0
[  210.202050][C0]  ? bd_link_disk_holder+0x690/0x690
[  210.203239][C0]  ? put_pages_list+0x210/0x210
[  210.204341][C0]  ? scan_shadow_nodes+0xb0/0xb0
[  210.205472][C0]  ? _raw_write_lock_irqsave+0xe0/0xe0
[  210.206698][C0]  ? bd_may_claim+0xc0/0xc0
[  210.207715][C0]  ? zero_user_segments.constprop.0+0x2e0/0x2e0
[  210.209092][C0]  ? blkdev_direct_IO+0xd16/0x1020
[  210.210200][C0]  ? pagevec_lookup_range_tag+0x28/0x60
[  210.211416][C0]  ? __filemap_fdatawait_range+0xc4/0x1f0
[  210.212669][C0]  ? page_cache_next_miss+0x1e0/0x1e0
[  210.213842][C0]  ? generic_file_buffered_read+0x520/0x9e0
[  210.215128][C0]  ? delete_from_page_cache_batch+0x850/0x850
[  210.216470][C0]  ? bd_abort_claiming+0xd0/0xd0
[  210.217531][C0]  ? file_remove_privs+0x74/0x430
[  210.218589][C0]  ? filemap_check_errors+0x50/0xe0
[  210.219705][C0]  ? generic_file_direct_write+0x1a3/0x480
[  210.220979][C0]  ? __generic_file_write_iter+0x1d9/0x530
[  210.38][C0]  ? blkdev_write_iter+0x20d/0x3e0
[  210.223328][C0]  ? bd_unlink_disk_holder+0x360/0x360
[  210.224464][C0]  ? new_sync_write+0x37b/0x620
[  210.225511][C0]  ? new_sync_read+0x610/0x610
[  210.226539][C0]  ? _cond_resched+0x17/0x80
[  210.227539][C0]  ? inode_security+0x58/0x100
[  210.228582][C0]  ? security_file_permission+0x54/0x450
[  210.229796][C0]  ? vfs_write+0x524/0x770
[  210.230758][C0]  ? __x64_sys_pwrite64+0x197/0x1f0
[  210.231890][C0]  ? vfs_write+0x770/0x770
[  210.232869][C0]  ? do_syscall_64+0x33/0x40
[  210.233839][C0]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by allocating crypto requests with GFP_ATOMIC mask in interrupt context

Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd 
workqueues")
Reported-by: Maciej S. Szmigiero 
Cc:  # v5.9+
Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 35 +--
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index a3326dadfd4d..1f471dd75144 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1454,13 +1454,16 @@ static int crypt_convert_block_skcipher(struct 
crypt_config *cc,
 static void kcryptd_async_done(struct crypto_async_request *async_req,
   int error);
 
-static void crypt_alloc_req_skcipher(struct crypt_config *cc,
+static int crypt_alloc_req_skcipher(struct crypt_config *cc,
 struct convert_context *ctx)
 {
unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);
 
-   if (!ctx->r.req)
-   ctx->r.req = mempool_alloc(&cc->req_pool, GFP_NOIO);
+   if (!ctx->r.req) {
+   ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? 
GFP_ATOMIC : GFP_NOIO);
+   if (!ctx->r.req)
+   return -ENOMEM;
+   }
 
skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
 
@@ -1471,13 +1474,18 @@ static void crypt_alloc_req_skcipher(struct 
crypt_config *cc,
skcipher_request_set_callback(ctx->r.req,
CRYPTO_TFM_REQ_MAY_BACKLOG,
kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
+
+   return 0;
 }
 
-static void crypt_alloc_req_aead(struct crypt_config *cc,
+static int crypt_alloc_req_aead(struct crypt_config *cc,
 struct convert_context *ctx)
 {
-   if (!ctx->r.req_aead)
-   ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
+   if (!ctx->r.req) {
+   ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? 
GFP_ATOMIC : GFP_NOIO);
+   if (!ctx->r.req)
+   return -ENOMEM;
+   }
 
aead_request_set_tfm(ctx->r.req_aead, cc->cipher_tfm.tfms_aead[0]);
 
@@ -1488,15 +1496,17 @@ static void crypt_alloc_req_aead(struct crypt_config 
*cc,
aead_request_set_callback(ctx->r.req_aead,
CRYPTO_TFM_REQ_MAY_BACKLOG,
kcryptd_async_done, dmreq_of_req(cc, ctx->r.req_aead));
+
+   return 0;
 }
 
-static void crypt_alloc_req(struct crypt_config *cc,
+static int crypt_alloc_req(struct crypt_config *cc,
struct convert_context *ctx)
 {
if (crypt_integrity_aead(cc))
-   crypt_alloc_req_aead(cc, ctx);
+   return crypt_alloc_req_aead(cc, ctx);
else
-   cr

Re: Linux 4.19 and GCC 9

2019-06-24 Thread Ignat Korchagin

Hi Greg,

> > For us it seems applying the following 4 mainline patches makes 4.19.x
> > branch perf compile with GCC-9:
> >
> > 4d0f16d059ddb91424480d88473f7392f24aebdc: perf ui helpline: Use
> > strlcpy() as a shorter form of strncpy() + explicit set nul
> > b6313899f4ed2e76b8375cf8069556f5b94fbff0: perf help: Remove needless
> > use of strncpy()
> > 5192bde7d98c99f2cd80225649e3c2e7493722f7: perf header: Fix unchecked
> > usage of strncpy()
> > 97acec7df172cd1e450f81f5e293c0aa145a2797: perf data: Fix 'strncat may
> > truncate' build failure with recent gcc
> >
> > I also checked that 4.19.49 compiles fine with GCC 9, although with a
> > lot of warnings, mostly from objtool, like "warning: objtool:
> > sock_register()+0xd: sibling call from callable instruction with
> > modified stack frame". But it's a start.
> >
> > Can we apply the above-mentioned patches, please?

> I'll look into these after the next round of kernels are released.  I

Did you by any chance forget to queue these patches? :) (the build is
still broken for GCC 9.1)

> guess I'll go find a distro that has gcc9 on it to actually test
> things...

BTW, Arch already has GCC 9.1 package, so no need to compile your own
anymore for testing:
https://www.archlinux.org/packages/core/x86_64/gcc/

Regards,
Ignat

Re: Linux 4.19 and GCC 9

2019-06-10 Thread Ignat Korchagin

Hi Greg,

For us it seems applying the following 4 mainline patches makes 4.19.x
branch perf compile with GCC-9:

4d0f16d059ddb91424480d88473f7392f24aebdc: perf ui helpline: Use
strlcpy() as a shorter form of strncpy() + explicit set nul
b6313899f4ed2e76b8375cf8069556f5b94fbff0: perf help: Remove needless
use of strncpy()
5192bde7d98c99f2cd80225649e3c2e7493722f7: perf header: Fix unchecked
usage of strncpy()
97acec7df172cd1e450f81f5e293c0aa145a2797: perf data: Fix 'strncat may
truncate' build failure with recent gcc

I also checked that 4.19.49 compiles fine with GCC 9, although with a
lot of warnings, mostly from objtool, like "warning: objtool:
sock_register()+0xd: sibling call from callable instruction with
modified stack frame". But it's a start.

Can we apply the above-mentioned patches, please?

Regards,
Ignat


On Mon, Jun 10, 2019 at 8:45 AM Greg KH  wrote:
>
> On Mon, Jun 10, 2019 at 12:21:51AM -0700, Ivan Babrou wrote:
> > Looks like 4.19.49 received some patches for GCC 9+, but unfortunately
> > perf still doesn't want to compile:
> >
> > [07:15:32]In file included from /usr/include/string.h:635,
> > [07:15:32] from util/debug.h:7,
> > [07:15:32] from builtin-help.c:15:
> > [07:15:32]In function 'strncpy',
> > [07:15:32] inlined from 'add_man_viewer' at builtin-help.c:192:2,
> > [07:15:32] inlined from 'perf_help_config' at builtin-help.c:284:3:
> > [07:15:32]/usr/include/x86_64-linux-gnu/bits/string3.h:126:10: error:
> > '__builtin_strncpy' output truncated before terminating nul copying as
> > many bytes from a string as its length [-Werror=stringop-truncation]
> > [07:15:32] 126 | return __builtin___strncpy_chk (__dest, __src, __len,
> > __bos (__dest));
> > [07:15:32] | ^~
> > [07:15:32]builtin-help.c: In function 'perf_help_config':
> > [07:15:32]builtin-help.c:187:15: note: length computed here
> > [07:15:32] 187 | size_t len = strlen(name);
> > [07:15:32] | ^~~~
> > [07:15:32]cc1: all warnings being treated as errors
>
>
> Any chance in finding a patch in Linus's tree that resolves this?  I
> don't have gcc9 on my systems here yet to test this.
>
> thanks,
>
> greg k-h
>
> --
> You received this message because you are subscribed to the Google Groups 
> "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@cloudflare.com.
> To view this discussion on the web visit 
> https://groups.google.com/a/cloudflare.com/d/msgid/kernel-team/20190610074510.GA24746%40kroah.com.

Re: Linux 4.19 and GCC 9

2019-06-10 Thread Ignat Korchagin

On Mon, Jun 10, 2019 at 3:49 PM Greg KH  wrote:
>
> >
> > I typically compile a bare-bones GCC for those things, it is quite quick.
>
> Pointers to how to do that is appreciated.  It's been years since I had
> to build gcc "from scratch".

This is how we do it, but we use it for some other projects as well,
so need ligcc and c++ support. I suspect for kernel-only there may be
a more lightweight approach (for example, by dropping c++):

Env: Debian Stretch (we run in a simple official docker container with
build-essential and make installed) - but probably should work on any
distro
Assuming the sources are extracted into $(BUILDDIR)/gcc-$(VERSION)

cd $(BUILDDIR)/gcc-$(VERSION)
./contrib/download_prerequisites
cd ..
mkdir gcc-build
cd gcc-build
../gcc-$(VERSION)/configure --enable-languages=c,c++
--build=x86_64-linux-gnu --disable-multilib
make -j
sudo make install (or install into alternative dir and point Linux
build system there)

Regards,
Ignat

[PATCH v2 1/3] um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS

2020-07-04 Thread Ignat Korchagin

For statically linked UML build it is important to take into account the
standard C-library implementation. Some implementations, notably glibc have
caveats: even when linked statically, the final program might require some
runtime dependencies, if certain functions are used within the code.

Consider the following program:
int main(void)
{
getpwent();
return 0;
}

Compiling this program and linking statically with glibc produces the following
warning from the linker:
/usr/sbin/ld: /tmp/ccuthw1o.o: in function `main':
test.c:(.text+0x5): warning: Using 'getpwent' in statically linked
applications requires at runtime the shared libraries from the glibc version
used for linking

We will use the flag to detect such C-library implementation build time and
possibly disable static linking for UML to avoid producing a binary with
unexpected behaviour and dependencies.

Signed-off-by: Ignat Korchagin 
---
 init/Kconfig   | 6 ++
 scripts/cc-can-link.sh | 5 +++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a46aa8f3174d..717a3ada765e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -57,6 +57,12 @@ config CC_CAN_LINK_STATIC
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) -static 
$(m64-flag)) if 64BIT
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) -static 
$(m32-flag))
 
+config CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
+   bool
+   depends on UML && CC_CAN_LINK_STATIC
+   default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) -static 
-Xlinker --fatal-warnings $(m64-flag)) if 64BIT
+   default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) -static 
-Xlinker --fatal-warnings $(m32-flag))
+
 config CC_HAS_ASM_GOTO
def_bool $(success,$(srctree)/scripts/gcc-goto.sh $(CC))
 
diff --git a/scripts/cc-can-link.sh b/scripts/cc-can-link.sh
index 6efcead31989..e5011a46103e 100755
--- a/scripts/cc-can-link.sh
+++ b/scripts/cc-can-link.sh
@@ -2,10 +2,11 @@
 # SPDX-License-Identifier: GPL-2.0
 
 cat << "END" | $@ -x c - -o /dev/null >/dev/null 2>&1
-#include 
+#include 
+#include 
 int main(void)
 {
-   printf("");
+   getpwent();
return 0;
 }
 END
-- 
2.20.1

[PATCH v2 3/3] um: allow static linking for non-glibc implementations

2020-07-04 Thread Ignat Korchagin

It is possible to produce a statically linked UML binary with UML_NET_VECTOR,
UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc
implementations, which do not rely on NSS, such as musl.

Allow static linking in this case.

Signed-off-by: Ignat Korchagin 
---
 arch/um/Kconfig | 2 +-
 arch/um/drivers/Kconfig | 3 ---
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 9318dc6d1a0c..af7ed63f9c74 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -67,7 +67,7 @@ config FORBID_STATIC_LINK
 
 config STATIC_LINK
bool "Force a static link"
-   depends on !FORBID_STATIC_LINK
+   depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || (!UML_NET_VECTOR && 
!UML_NET_VDE && !UML_NET_PCAP)
help
  This option gives you the ability to force a static link of UML.
  Normally, UML is linked as a shared binary.  This is inconvenient for
diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig
index 9160ead56e33..72d417055782 100644
--- a/arch/um/drivers/Kconfig
+++ b/arch/um/drivers/Kconfig
@@ -234,7 +234,6 @@ config UML_NET_DAEMON
 config UML_NET_VECTOR
bool "Vector I/O high performance network devices"
depends on UML_NET
-   select FORBID_STATIC_LINK
help
This User-Mode Linux network driver uses multi-message send
and receive functions. The host running the UML guest must have
@@ -246,7 +245,6 @@ config UML_NET_VECTOR
 config UML_NET_VDE
bool "VDE transport (obsolete)"
depends on UML_NET
-   select FORBID_STATIC_LINK
help
This User-Mode Linux network transport allows one or more running
UMLs on a single host to communicate with each other and also
@@ -294,7 +292,6 @@ config UML_NET_MCAST
 config UML_NET_PCAP
bool "pcap transport (obsolete)"
depends on UML_NET
-   select FORBID_STATIC_LINK
help
The pcap transport makes a pcap packet stream on the host look
like an ethernet device inside UML.  This is useful for making
-- 
2.20.1

[PATCH v2 2/3] um: some fixes to build UML with musl

2020-07-04 Thread Ignat Korchagin

musl toolchain and headers are a bit more strict. These fixes enable building
UML with musl as well as seem not to break on glibc.

Signed-off-by: Ignat Korchagin 
---
 arch/um/drivers/daemon_user.c |  1 +
 arch/um/drivers/pcap_user.c   | 12 ++--
 arch/um/drivers/slip_user.c   |  2 +-
 arch/um/drivers/vector_user.c |  4 +---
 arch/um/os-Linux/util.c   |  2 +-
 arch/x86/um/user-offsets.c|  2 +-
 6 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/arch/um/drivers/daemon_user.c b/arch/um/drivers/daemon_user.c
index 3695821d06a2..785baedc3555 100644
--- a/arch/um/drivers/daemon_user.c
+++ b/arch/um/drivers/daemon_user.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/um/drivers/pcap_user.c b/arch/um/drivers/pcap_user.c
index bbd20638788a..52ddda3e3b10 100644
--- a/arch/um/drivers/pcap_user.c
+++ b/arch/um/drivers/pcap_user.c
@@ -32,7 +32,7 @@ static int pcap_user_init(void *data, void *dev)
return 0;
 }
 
-static int pcap_open(void *data)
+static int pcap_user_open(void *data)
 {
struct pcap_data *pri = data;
__u32 netmask;
@@ -44,14 +44,14 @@ static int pcap_open(void *data)
if (pri->filter != NULL) {
err = dev_netmask(pri->dev, &netmask);
if (err < 0) {
-   printk(UM_KERN_ERR "pcap_open : dev_netmask failed\n");
+   printk(UM_KERN_ERR "pcap_user_open : dev_netmask 
failed\n");
return -EIO;
}
 
pri->compiled = uml_kmalloc(sizeof(struct bpf_program),
UM_GFP_KERNEL);
if (pri->compiled == NULL) {
-   printk(UM_KERN_ERR "pcap_open : kmalloc failed\n");
+   printk(UM_KERN_ERR "pcap_user_open : kmalloc failed\n");
return -ENOMEM;
}
 
@@ -59,14 +59,14 @@ static int pcap_open(void *data)
   (struct bpf_program *) pri->compiled,
   pri->filter, pri->optimize, netmask);
if (err < 0) {
-   printk(UM_KERN_ERR "pcap_open : pcap_compile failed - "
+   printk(UM_KERN_ERR "pcap_user_open : pcap_compile 
failed - "
   "'%s'\n", pcap_geterr(pri->pcap));
goto out;
}
 
err = pcap_setfilter(pri->pcap, pri->compiled);
if (err < 0) {
-   printk(UM_KERN_ERR "pcap_open : pcap_setfilter "
+   printk(UM_KERN_ERR "pcap_user_open : pcap_setfilter "
   "failed - '%s'\n", pcap_geterr(pri->pcap));
goto out;
}
@@ -127,7 +127,7 @@ int pcap_user_read(int fd, void *buffer, int len, struct 
pcap_data *pri)
 
 const struct net_user_info pcap_user_info = {
.init   = pcap_user_init,
-   .open   = pcap_open,
+   .open   = pcap_user_open,
.close  = NULL,
.remove = pcap_remove,
.add_address= NULL,
diff --git a/arch/um/drivers/slip_user.c b/arch/um/drivers/slip_user.c
index 8016d32b6809..482a19c5105c 100644
--- a/arch/um/drivers/slip_user.c
+++ b/arch/um/drivers/slip_user.c
@@ -9,7 +9,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c
index c4a0f26b2824..45d4164ad355 100644
--- a/arch/um/drivers/vector_user.c
+++ b/arch/um/drivers/vector_user.c
@@ -18,9 +18,7 @@
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -332,7 +330,7 @@ static struct vector_fds *user_init_unix_fds(struct arglist 
*ifspec, int id)
}
switch (id) {
case ID_BESS:
-   if (connect(fd, remote_addr, sizeof(struct sockaddr_un)) < 0) {
+   if (connect(fd, (const struct sockaddr *) remote_addr, 
sizeof(struct sockaddr_un)) < 0) {
printk(UM_KERN_ERR "bess open:cannot connect to %s %i", 
remote_addr->sun_path, -errno);
goto unix_cleanup;
}
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index ecf2f390fad2..07327425d06e 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -10,7 +10,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/um/user-offsets.c b/arch/x86/um/user-offsets.c
index c51dd8363d25..bae61554abcc 100644
--- a/arch/x86/um/user-offsets.c
+++ b/arch/x86/um/user-offsets.c
@@ -2,7 +2,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #define __FRAME_OFFSETS
-- 
2.20.1

[PATCH v2 0/3] um: allow static linking for non-glibc libc implementations

2020-07-04 Thread Ignat Korchagin

This is a continuation of [1]. Since I was able to produce a working UML binary
with UML_NET_VECTOR linked with musl with the changes included in the patches
here. I was compiling on Arch Linux, so hopefully all the latest versions of
the compiler, libraries and binutils.

I also tested allyesconfig with both musl and glibc. The compilation succeeds
with both, however both binaries (glibc one being dynamically linked) segfault
on start. This is probably of some incompatible config option/module being
included and not related to musl/glibc.

[1]: 
https://patchwork.ozlabs.org/project/linux-um/patch/20200624212319.403689-1-ig...@cloudflare.com/

Ignat Korchagin (3):
  um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
  um: some fixes to build UML with musl
  um: allow static linking for non-glibc implementations

 arch/um/Kconfig   |  2 +-
 arch/um/drivers/Kconfig   |  3 ---
 arch/um/drivers/daemon_user.c |  1 +
 arch/um/drivers/pcap_user.c   | 12 ++--
 arch/um/drivers/slip_user.c   |  2 +-
 arch/um/drivers/vector_user.c |  4 +---
 arch/um/os-Linux/util.c   |  2 +-
 arch/x86/um/user-offsets.c|  2 +-
 init/Kconfig  |  6 ++
 scripts/cc-can-link.sh|  5 +++--
 10 files changed, 21 insertions(+), 18 deletions(-)

-- 
2.20.1

Re: [PATCH v2] dm crypt: add flags to optionally bypass dm-crypt workqueues

2020-06-30 Thread Ignat Korchagin

On Tue, Jun 30, 2020 at 3:51 AM Damien Le Moal  wrote:
>
> On 2020/06/27 6:03, Ignat Korchagin wrote:
> > This is a follow up from [1]. Consider the following script:
> >
> > sudo modprobe brd rd_nr=1 rd_size=4194304
> >
> > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \
> > sudo dmsetup create eram0
> >
> > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> > no_write_workqueue' | \
> > sudo dmsetup create eram0-inline-write
> >
> > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> > no_read_workqueue' | \
> > sudo dmsetup create eram0-inline-read
> >
> > devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read "
> > devices+="/dev/mapper/eram0-inline-write"
> >
> > for dev in $devices; do
> >   echo "reading from $dev"
> >   sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \
> >   --loops=100 --runtime=3m --name=plain | grep READ
> > done
> >
> > for dev in $devices; do
> >   echo "writing to $dev"
> >   sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \
> >   --loops=100 --runtime=3m --name=plain | grep WRITE
> > done
> >
> > This script creates a ramdisk (to eliminate hardware bias in the benchmark) 
> > and
> > three dm-crypt instances on top. All dm-crypt instances use the NULL cipher
> > to eliminate potentially expensive crypto bias (the NULL cipher just uses 
> > memcpy
> > for "encyption"). The first instance is the current dm-crypt implementation 
> > from
> > 5.8-rc2, the two others have new optional flags enabled, which bypass 
> > kcryptd
> > workqueues for reads and writes respectively and write sorting for writes. 
> > On
> > my VM (Debian in VirtualBox with 4 cores on 2.8 GHz Quad-Core Intel Core 
> > i7) I
> > get the following output (formatted for better readability):
> >
> > reading from /dev/ram0
> >READ: bw=508MiB/s (533MB/s), 508MiB/s-508MiB/s (533MB/s-533MB/s), 
> > io=89.3GiB (95.9GB), run=18-18msec
> >
> > reading from /dev/mapper/eram0
> >READ: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), 
> > io=14.2GiB (15.2GB), run=18-18msec
> >
> > reading from /dev/mapper/eram0-inline-read
> >READ: bw=295MiB/s (309MB/s), 295MiB/s-295MiB/s (309MB/s-309MB/s), 
> > io=51.8GiB (55.6GB), run=18-18msec
> >
> > reading from /dev/mapper/eram0-inline-write
> >READ: bw=114MiB/s (120MB/s), 114MiB/s-114MiB/s (120MB/s-120MB/s), 
> > io=20.1GiB (21.5GB), run=18-18msec
> >
> > writing to /dev/ram0
> >   WRITE: bw=516MiB/s (541MB/s), 516MiB/s-516MiB/s (541MB/s-541MB/s), 
> > io=90.7GiB (97.4GB), run=180001-180001msec
> >
> > writing to /dev/mapper/eram0
> >   WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), 
> > io=7271MiB (7624MB), run=180001-180001msec
> >
> > writing to /dev/mapper/eram0-inline-read
> >   WRITE: bw=38.9MiB/s (40.8MB/s), 38.9MiB/s-38.9MiB/s (40.8MB/s-40.8MB/s), 
> > io=7000MiB (7340MB), run=180001-180001msec
> >
> > writing to /dev/mapper/eram0-inline-write
> >   WRITE: bw=277MiB/s (290MB/s), 277MiB/s-277MiB/s (290MB/s-290MB/s), 
> > io=48.6GiB (52.2GB), run=18-18msec
> >
> > Current dm-crypt implementation creates a significant IO performance 
> > overhead
> > (at least on small IO block sizes) for both latency and throughput. We 
> > suspect
> > offloading IO request processing into workqueues and async threads is more
> > harmful these days with the modern fast storage. I also did some digging 
> > into
> > the dm-crypt git history and much of this async processing is not needed
> > anymore, because the reasons it was added are mostly gone from the kernel. 
> > More
> > details can be found in [2] (see "Git archeology" section).
> >
> > This change adds no_(read|write)_workqueue flags separately for read and 
> > write
> > BIOs, which direct dm-crypt not to offload crypto operations into kcryptd
> > workqueues and process everything inline. In addition, writes are not 
> > buffered
> > to be sorted in the dm-crypt red-black tree, but dispatched immediately. For
> > cases, where crypto operations cannot happen inline (hard interrupt context,
> > for example the read path of some NVME drivers), we offload the work to a
> > tasklet rather than a workqueue.
> >
> > These flags ensure inline BIO processing in the

Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

2020-06-24 Thread Ignat Korchagin

On Wed, Jun 24, 2020 at 5:24 PM Eric Biggers  wrote:
>
> On Wed, Jun 24, 2020 at 09:24:07AM +0100, Ignat Korchagin wrote:
> > On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers  wrote:
> > >
> > > On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> > > > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. 
> > > > This is
> > > > especially visible on busy systems with many processes/threads. 
> > > > Moreover, most
> > > > Crypto API implementaions are async, that is they offload crypto 
> > > > operations on
> > > > their own, so this dm-crypt offloading is excessive.
> > >
> > > This really should say "some Crypto API implementations are async" 
> > > instead of
> > > "most Crypto API implementations are async".
> >
> > The most accurate would probably be: most hardware-accelerated Crypto
> > API implementations are async
> >
> > > Notably, the AES-NI implementation of AES-XTS is synchronous if you call 
> > > it in a
> > > context where SIMD instructions are usable.  It's only asynchronous when 
> > > SIMD is
> > > not usable.  (This seems to have been missed in your blog post.)
> >
> > No, it was not. This is exactly why we made xts-proxy Crypto API
> > module as a second patch. But it seems now it does not make a big
> > difference if a used Crypto API implementation is synchronous as well
> > (based on some benchmarks outlined in the cover letter to this patch).
> > I think the v2 of this patch will not require a synchronous Crypto
> > API. This is probably a right thing to do, as the "inline" flag should
> > control the way how dm-crypt itself handles requests, not how Crypto
> > API handles requests. If a user wants to ensure a particular
> > synchronous Crypto API implementation, they can already reconfigure
> > dm-crypt and specify the implementation with a "capi:" prefix in the
> > the dm table description.
>
> I think you're missing the point.  Although xts-aes-aesni has the
> CRYPTO_ALG_ASYNC bit set, the actual implementation processes the request
> synchronously if SIMD instructions are currently usable.  That's always the 
> case
> in dm-crypt, as far as I can tell.  This algorithm has the ASYNC flag only
> because it's not synchronous when called in hardIRQ context.
>
> That's why your "xts-proxy" doesn't make a difference, and why it's misleading
> to suggest that the crypto API is doing its own queueing when you're primarily
> talking about xts-aes-aesni.  The crypto API definitely can do its own 
> queueing,
> mainly with hardware drivers.  But it doesn't in this common and relevant 
> case.

I think we're talking about the same things but from different points
of view. I would like to clarify that the whole post and this change
does not have the intention to focus on aesni (or any x86-specific
crypto optimizations). In fact it is quite the opposite: we want to
optimize the generic dm-crypt regardless of which crypto is used
(that's why I just used a NULL cipher in the cover letter). We also
have some arm64 machines [1] and I bet they would benefit here as
well. The important point my post tries to make is that the original
workqueue offloading in dm-crypt was added because the Crypto API was
synchronous back in the day and, exactly as you say, you may not be
able to use some hw-accelerated crypto in hard IRQ context as well as
doing non-hw crypto in hard IRQ context is a bad idea. Now, most
Crypto API are smart enough to figure out on their own if they should
process the request inline or offload it to a workqueue, so the
workarounds in the dm-crypt itself most likely are not needed. Though,
the generic Crypto API "cipher walk" function does refuse to "walk"
the buffers in hard IRQ context, so the "tasklet" functionality is
still required.

But from the dm-crypt perspective - it should not take into account if
a particular xts-aes-aesni implementation is MOSTLY synchronous -
those are details of the implementation of this particular cipher
dm-crypt has no visibility into. So it would be right to say in my
opinion if the cipher has the CRYPTO_ALG_ASYNC flag set - it can
offload the crypto request to a workqueue at any time. How often does
it do it - that's another story and probably should be reviewed
elsewhere, if it does it too often.

Ignat

[1]: https://blog.cloudflare.com/arm-takes-wing/

> - Eric

[RFC PATCH] Revert "um: Make CONFIG_STATIC_LINK actually static"

2020-06-24 Thread Ignat Korchagin

This reverts commit 3363179385629c1804ea846f4e72608c2201a81e.

This change is too restrictive. I've been running UML statically linked kernel
 with UML_NET_VECTOR networking in a docker "FROM: scratch" container just fine.
As long as we don't reference network peers by hostname and use only IP
addresses, NSS is not needed, so not used. In other words, it is possible to
have statically linked UML and UML_NET_VECTOR (and other networking types) and
use it, although with some restrictions, so let's not disable it.

Additionally, it should be at least theoretically possible to use another libc
(like musl, bionic etc) for static linking. I was able with some hacks to
compile UML against musl, although the executable segfaults for now. But this
option prevents even the research to be done.

Signed-off-by: Ignat Korchagin 
---
 arch/um/Kconfig | 8 +---
 arch/um/drivers/Kconfig | 3 ---
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 96ab7026b037..817a4c838a06 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -62,12 +62,9 @@ config NR_CPUS
 
 source "arch/$(HEADER_ARCH)/um/Kconfig"
 
-config FORBID_STATIC_LINK
-   bool
-
 config STATIC_LINK
bool "Force a static link"
-   depends on !FORBID_STATIC_LINK
+   default n
help
  This option gives you the ability to force a static link of UML.
  Normally, UML is linked as a shared binary.  This is inconvenient for
@@ -76,9 +73,6 @@ config STATIC_LINK
  Additionally, this option enables using higher memory spaces (up to
  2.75G) for UML.
 
- NOTE: This option is incompatible with some networking features which
- depend on features that require being dynamically loaded (like NSS).
-
 config LD_SCRIPT_STATIC
bool
default y
diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig
index 9160ead56e33..72d417055782 100644
--- a/arch/um/drivers/Kconfig
+++ b/arch/um/drivers/Kconfig
@@ -234,7 +234,6 @@ config UML_NET_DAEMON
 config UML_NET_VECTOR
bool "Vector I/O high performance network devices"
depends on UML_NET
-   select FORBID_STATIC_LINK
help
This User-Mode Linux network driver uses multi-message send
and receive functions. The host running the UML guest must have
@@ -246,7 +245,6 @@ config UML_NET_VECTOR
 config UML_NET_VDE
bool "VDE transport (obsolete)"
depends on UML_NET
-   select FORBID_STATIC_LINK
help
This User-Mode Linux network transport allows one or more running
UMLs on a single host to communicate with each other and also
@@ -294,7 +292,6 @@ config UML_NET_MCAST
 config UML_NET_PCAP
bool "pcap transport (obsolete)"
depends on UML_NET
-   select FORBID_STATIC_LINK
help
The pcap transport makes a pcap packet stream on the host look
like an ethernet device inside UML.  This is useful for making
-- 
2.20.1

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-23 Thread Ignat Korchagin

Do you think it may be better to break it in two flags: one for read
path and one for write? So, depending on the needs and workflow these
could be enabled independently?

Regards,
Ignat

On Tue, Jun 23, 2020 at 4:01 PM Mike Snitzer  wrote:
>
> On Sun, Jun 21 2020 at  8:45pm -0400,
> Damien Le Moal  wrote:
>
> > On 2020/06/20 1:56, Mike Snitzer wrote:
> > > On Fri, Jun 19 2020 at 12:41pm -0400,
> > > Ignat Korchagin  wrote:
> > >
> > >> This is a follow up from the long-forgotten [1], but with some more 
> > >> convincing
> > >> evidence. Consider the following script:
> > >>
> > >> #!/bin/bash -e
> > >>
> > >> # create 4G ramdisk
> > >> sudo modprobe brd rd_nr=1 rd_size=4194304
> > >>
> > >> # create a dm-crypt device with NULL cipher on top of /dev/ram0
> > >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo 
> > >> dmsetup create eram0
> > >>
> > >> # create a dm-crypt device with NULL cipher and custom force_inline flag
> > >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> > >> force_inline' | sudo dmsetup create inline-eram0
> > >>
> > >> # read all data from /dev/ram0
> > >> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
> > >>
> > >> # read the same data from /dev/mapper/eram0
> > >> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
> > >>
> > >> # read the same data from /dev/mapper/inline-eram0
> > >> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
> > >>
> > >> This script creates a ramdisk (to eliminate hardware bias in the 
> > >> benchmark) and
> > >> two dm-crypt instances on top. Both dm-crypt instances use the NULL 
> > >> cipher
> > >> to eliminate potentially expensive crypto bias (the NULL cipher just 
> > >> uses memcpy
> > >> for "encyption"). The first instance is the current dm-crypt 
> > >> implementation from
> > >> 5.8-rc1, the second is the dm-crypt instance with a custom new flag 
> > >> enabled from
> > >> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 
> > >> cores
> > >> on 2.8 GHz Quad-Core Intel Core i7) I get the following output 
> > >> (formatted for
> > >> better readability):
> > >>
> > >> # plain ram0
> > >> 1048576+0 records in
> > >> 1048576+0 records out
> > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
> > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> > >>
> > >> # eram0 (current dm-crypt)
> > >> 1048576+0 records in
> > >> 1048576+0 records out
> > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
> > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> > >>
> > >> # inline-eram0 (patched dm-crypt)
> > >> 1048576+0 records in
> > >> 1048576+0 records out
> > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
> > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> > >>
> > >> As we can see, current dm-crypt implementation creates a significant IO
> > >> performance overhead (at least on small IO block sizes) for both latency 
> > >> and
> > >> throughput. We suspect offloading IO request processing into workqueues 
> > >> and
> > >> async threads is more harmful these days with the modern fast storage. I 
> > >> also
> > >> did some digging into the dm-crypt git history and much of this async 
> > >> processing
> > >> is not needed anymore, because the reasons it was added are mostly gone 
> > >> from the
> > >> kernel. More details can be found in [2] (see "Git archeology" section).
> > >>
> > >> We have been running the attached patch on different hardware 
> > >> generations in
> > >> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far 
> > >> were very
> > >> happy with the performance benefits.
> > >>
> > >> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> > >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> > >>
> > >> Ignat Korchagin

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-23 Thread Ignat Korchagin

On Tue, Jun 23, 2020 at 4:34 PM Mike Snitzer  wrote:
>
> On Fri, Jun 19 2020 at  9:23pm -0400,
> Herbert Xu  wrote:
>
> > On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote:
> > >
> > > I'm looking at this and I'd like to know why does the crypto API fail in
> > > hard-irq context and why does it work in tasklet context. What's the exact
> > > reason behind this?
> >
> > You're not supposed to do any real work in IRQ handlers.  All
> > the substantial work should be postponed to softirq context.
> >
> > Why do you need to do work in hard IRQ context?
>
> Ignat, think you may have missed Herbert's question?
>
> My understanding is that you're doing work in hard IRQ context (via
> tasklet) precisely to avoid overhead of putting to a workqueue?  Did
> you try using a workqueue and it didn't work adequately?  If so, do you
> have a handle on why that is?  E.g. was it due to increased latency? or
> IO completion occurring on different cpus that submitted (are you
> leaning heavily on blk-mq's ability to pin IO completion to same cpu as
> IO was submitted?)
>
> I'm fishing here but I'd just like to tease out the details for _why_
> you need to do work from hard IRQ via tasklet so that I can potentially
> defend it if needed.

I may be misunderstanding the terminology, but tasklets execute in
soft IRQ, don't they? What we care about is to execute the decryption
as fast as possible, but we can't do it in a hard IRQ context (that
is, the interrupt context where other interrupts are disabled). As far
as I understand, tasklets are executed right after the hard IRQ
context, but with interrupts enabled - which is the first safe-ish
place to do more lengthy processing without the risk of missing an
interrupt.

Workqueues instead of tasklets - is the way how it is mostly
implemented now. But that creates additional IO latency, most probably
due to the fact that we're introducing CPU scheduling latency into the
overall read IO latency. This is confirmed by the fact that our busier
production systems have much worse and more important - spiky and
unstable p99 read latency, which somewhat correlates to high CPU
scheduling latency reported by metrics.

So, by inlining crypto or using a tasklet we're effectively
prioritising IO encryption/decryption. What we want to avoid is mixing
unpredicted additional latency from an unrelated subsystem (CPU
scheduler), because our expectation is that the total latency should
be real HW io latency + crypto operation latency (which is usually
quite stable).

I hope this makes sense.

>
> Thanks,
> Mike
>

Re: [PATCH v2 2/3] um: some fixes to build UML with musl

2020-07-14 Thread Ignat Korchagin

On Tue, Jul 14, 2020 at 9:40 AM Anton Ivanov
 wrote:
>
>
> On 04/07/2020 09:52, Ignat Korchagin wrote:
> > musl toolchain and headers are a bit more strict. These fixes enable 
> > building
> > UML with musl as well as seem not to break on glibc.
> >
> > Signed-off-by: Ignat Korchagin 
> > ---
> >   arch/um/drivers/daemon_user.c |  1 +
> >   arch/um/drivers/pcap_user.c   | 12 ++--
> >   arch/um/drivers/slip_user.c   |  2 +-
> >   arch/um/drivers/vector_user.c |  4 +---
> >   arch/um/os-Linux/util.c   |  2 +-
> >   arch/x86/um/user-offsets.c|  2 +-
> >   6 files changed, 11 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/um/drivers/daemon_user.c b/arch/um/drivers/daemon_user.c
> > index 3695821d06a2..785baedc3555 100644
> > --- a/arch/um/drivers/daemon_user.c
> > +++ b/arch/um/drivers/daemon_user.c
> > @@ -7,6 +7,7 @@
> >*/
> >
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > diff --git a/arch/um/drivers/pcap_user.c b/arch/um/drivers/pcap_user.c
> > index bbd20638788a..52ddda3e3b10 100644
> > --- a/arch/um/drivers/pcap_user.c
> > +++ b/arch/um/drivers/pcap_user.c
> > @@ -32,7 +32,7 @@ static int pcap_user_init(void *data, void *dev)
> >   return 0;
> >   }
> >
> > -static int pcap_open(void *data)
> > +static int pcap_user_open(void *data)
>
> This change in the function name was introduced on purpose to avoid name 
> clash in some version of libpcap which export pcap_open

Yes

>
>
> >   {
> >   struct pcap_data *pri = data;
> >   __u32 netmask;
> > @@ -44,14 +44,14 @@ static int pcap_open(void *data)
> >   if (pri->filter != NULL) {
> >   err = dev_netmask(pri->dev, &netmask);
> >   if (err < 0) {
> > - printk(UM_KERN_ERR "pcap_open : dev_netmask 
> > failed\n");
> > + printk(UM_KERN_ERR "pcap_user_open : dev_netmask 
> > failed\n");
> >   return -EIO;
> >   }
> >
> >   pri->compiled = uml_kmalloc(sizeof(struct bpf_program),
> >   UM_GFP_KERNEL);
> >   if (pri->compiled == NULL) {
> > - printk(UM_KERN_ERR "pcap_open : kmalloc failed\n");
> > + printk(UM_KERN_ERR "pcap_user_open : kmalloc 
> > failed\n");
> >   return -ENOMEM;
> >   }
> >
> > @@ -59,14 +59,14 @@ static int pcap_open(void *data)
> >  (struct bpf_program *) pri->compiled,
> >  pri->filter, pri->optimize, netmask);
> >   if (err < 0) {
> > - printk(UM_KERN_ERR "pcap_open : pcap_compile failed - 
> > "
> > + printk(UM_KERN_ERR "pcap_user_open : pcap_compile 
> > failed - "
> >  "'%s'\n", pcap_geterr(pri->pcap));
> >   goto out;
> >   }
> >
> >   err = pcap_setfilter(pri->pcap, pri->compiled);
> >   if (err < 0) {
> > - printk(UM_KERN_ERR "pcap_open : pcap_setfilter "
> > + printk(UM_KERN_ERR "pcap_user_open : pcap_setfilter "
> >  "failed - '%s'\n", pcap_geterr(pri->pcap));
> >   goto out;
> >   }
> > @@ -127,7 +127,7 @@ int pcap_user_read(int fd, void *buffer, int len, 
> > struct pcap_data *pri)
> >
> >   const struct net_user_info pcap_user_info = {
> >   .init   = pcap_user_init,
> > - .open   = pcap_open,
> > + .open   = pcap_user_open,
> >   .close  = NULL,
> >   .remove = pcap_remove,
> >   .add_address= NULL,
> > diff --git a/arch/um/drivers/slip_user.c b/arch/um/drivers/slip_user.c
> > index 8016d32b6809..482a19c5105c 100644
> > --- a/arch/um/drivers/slip_user.c
> > +++ b/arch/um/drivers/slip_user.c
> > @@ -9,7 +9,7 @@
> >   #include 
> >   #include 
> >   #include 
> > -#include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c
> > index c4a0f26b2824..45d4164ad355 100644
> > ---

Re: [PATCH v2 3/3] um: allow static linking for non-glibc implementations

2020-07-15 Thread Ignat Korchagin

On Wed, Jul 15, 2020 at 9:44 AM Brendan Higgins
 wrote:
>
> On Sat, Jul 4, 2020 at 1:52 AM Ignat Korchagin  wrote:
> >
> > It is possible to produce a statically linked UML binary with 
> > UML_NET_VECTOR,
> > UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc
> > implementations, which do not rely on NSS, such as musl.
> >
> > Allow static linking in this case.
> >
> > Signed-off-by: Ignat Korchagin 
>
> One minor issue below. Other than that:
>
> Reviewed-by: Brendan Higgins 
>
> > ---
> >  arch/um/Kconfig | 2 +-
> >  arch/um/drivers/Kconfig | 3 ---
> >  2 files changed, 1 insertion(+), 4 deletions(-)
> >
> > diff --git a/arch/um/Kconfig b/arch/um/Kconfig
> > index 9318dc6d1a0c..af7ed63f9c74 100644
> > --- a/arch/um/Kconfig
> > +++ b/arch/um/Kconfig
> > @@ -67,7 +67,7 @@ config FORBID_STATIC_LINK
>
> Doesn't look like FORBID_STATIC_LINK is used anymore, so you should
> probably drop it as well.

Right, good catch! I will repost the series with this adjusted as well
cc Masahiro Yamada as mentioned for +1 on the changes to can-link.sh
script.
Seems I also need to rebase now to accommodate changes in
b816b3db15f68690ee72a4a414624f8e82942b25 ("kbuild: fix
CONFIG_CC_CAN_LINK(_STATIC) for cross-compilation with Clang")

> With the preceding changes, in this patchset, you can revert my patch
> like you did in the RFC - or not, your choice. I am not offended by
> people reverting my commits. I just don't like it when people break
> allyesconfig. :-)
>
> >  config STATIC_LINK
> > bool "Force a static link"
> > -   depends on !FORBID_STATIC_LINK
> > +   depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || (!UML_NET_VECTOR 
> > && !UML_NET_VDE && !UML_NET_PCAP)
> > help
> >   This option gives you the ability to force a static link of UML.
> >   Normally, UML is linked as a shared binary.  This is inconvenient 
> > for
> > diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig
> > index 9160ead56e33..72d417055782 100644
> > --- a/arch/um/drivers/Kconfig
> > +++ b/arch/um/drivers/Kconfig
> > @@ -234,7 +234,6 @@ config UML_NET_DAEMON
> >  config UML_NET_VECTOR
> > bool "Vector I/O high performance network devices"
> > depends on UML_NET
> > -   select FORBID_STATIC_LINK
> > help
> > This User-Mode Linux network driver uses multi-message send
> > and receive functions. The host running the UML guest must have
> > @@ -246,7 +245,6 @@ config UML_NET_VECTOR
> >  config UML_NET_VDE
> > bool "VDE transport (obsolete)"
> > depends on UML_NET
> > -   select FORBID_STATIC_LINK
> > help
> > This User-Mode Linux network transport allows one or more running
> > UMLs on a single host to communicate with each other and also
> > @@ -294,7 +292,6 @@ config UML_NET_MCAST
> >  config UML_NET_PCAP
> > bool "pcap transport (obsolete)"
> > depends on UML_NET
> > -   select FORBID_STATIC_LINK
> > help
> > The pcap transport makes a pcap packet stream on the host look
> > like an ethernet device inside UML.  This is useful for making
> > --
> > 2.20.1
> >

[PATCH v3 0/3] um: allow static linking for non-glibc libc implementations

2020-07-15 Thread Ignat Korchagin

Changes from v2:
  * rebased to accomodate b816b3db15f68690ee72a4a414624f8e82942b25 ("kbuild: 
fix CONFIG_CC_CAN_LINK(_STATIC) for cross-compilation with Clang")
  * removed unused FORBID_STATIC_LINK config option
  * cc-ed Masahiro Yamada  for can-link.sh approve

This is a continuation of [1]. Since I was able to produce a working UML binary
with UML_NET_VECTOR linked with musl with the changes included in the patches
here. I was compiling on Arch Linux, so hopefully all the latest versions of
the compiler, libraries and binutils.

I also tested allyesconfig with both musl and glibc. The compilation succeeds
with both, however both binaries (glibc one being dynamically linked) segfault
on start. This is probably of some incompatible config option/module being
included and not related to musl/glibc.

[1]: 
https://patchwork.ozlabs.org/project/linux-um/patch/20200624212319.403689-1-ig...@cloudflare.com/

Ignat Korchagin (3):
  um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
  um: some fixes to build UML with musl
  um: allow static linking for non-glibc implementations

 arch/um/Kconfig   |  5 +
 arch/um/drivers/Kconfig   |  3 ---
 arch/um/drivers/daemon_user.c |  1 +
 arch/um/drivers/pcap_user.c   | 12 ++--
 arch/um/drivers/slip_user.c   |  2 +-
 arch/um/drivers/vector_user.c |  4 +---
 arch/um/os-Linux/util.c   |  2 +-
 arch/x86/um/user-offsets.c|  2 +-
 init/Kconfig  |  6 ++
 scripts/cc-can-link.sh|  5 +++--
 10 files changed, 21 insertions(+), 21 deletions(-)

-- 
2.20.1

[PATCH v3 2/3] um: some fixes to build UML with musl

2020-07-15 Thread Ignat Korchagin

musl toolchain and headers are a bit more strict. These fixes enable building
UML with musl as well as seem not to break on glibc.

Signed-off-by: Ignat Korchagin 
Tested-by: Brendan Higgins 
---
 arch/um/drivers/daemon_user.c |  1 +
 arch/um/drivers/pcap_user.c   | 12 ++--
 arch/um/drivers/slip_user.c   |  2 +-
 arch/um/drivers/vector_user.c |  4 +---
 arch/um/os-Linux/util.c   |  2 +-
 arch/x86/um/user-offsets.c|  2 +-
 6 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/arch/um/drivers/daemon_user.c b/arch/um/drivers/daemon_user.c
index 3695821d06a2..785baedc3555 100644
--- a/arch/um/drivers/daemon_user.c
+++ b/arch/um/drivers/daemon_user.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/um/drivers/pcap_user.c b/arch/um/drivers/pcap_user.c
index bbd20638788a..52ddda3e3b10 100644
--- a/arch/um/drivers/pcap_user.c
+++ b/arch/um/drivers/pcap_user.c
@@ -32,7 +32,7 @@ static int pcap_user_init(void *data, void *dev)
return 0;
 }
 
-static int pcap_open(void *data)
+static int pcap_user_open(void *data)
 {
struct pcap_data *pri = data;
__u32 netmask;
@@ -44,14 +44,14 @@ static int pcap_open(void *data)
if (pri->filter != NULL) {
err = dev_netmask(pri->dev, &netmask);
if (err < 0) {
-   printk(UM_KERN_ERR "pcap_open : dev_netmask failed\n");
+   printk(UM_KERN_ERR "pcap_user_open : dev_netmask 
failed\n");
return -EIO;
}
 
pri->compiled = uml_kmalloc(sizeof(struct bpf_program),
UM_GFP_KERNEL);
if (pri->compiled == NULL) {
-   printk(UM_KERN_ERR "pcap_open : kmalloc failed\n");
+   printk(UM_KERN_ERR "pcap_user_open : kmalloc failed\n");
return -ENOMEM;
}
 
@@ -59,14 +59,14 @@ static int pcap_open(void *data)
   (struct bpf_program *) pri->compiled,
   pri->filter, pri->optimize, netmask);
if (err < 0) {
-   printk(UM_KERN_ERR "pcap_open : pcap_compile failed - "
+   printk(UM_KERN_ERR "pcap_user_open : pcap_compile 
failed - "
   "'%s'\n", pcap_geterr(pri->pcap));
goto out;
}
 
err = pcap_setfilter(pri->pcap, pri->compiled);
if (err < 0) {
-   printk(UM_KERN_ERR "pcap_open : pcap_setfilter "
+   printk(UM_KERN_ERR "pcap_user_open : pcap_setfilter "
   "failed - '%s'\n", pcap_geterr(pri->pcap));
goto out;
}
@@ -127,7 +127,7 @@ int pcap_user_read(int fd, void *buffer, int len, struct 
pcap_data *pri)
 
 const struct net_user_info pcap_user_info = {
.init   = pcap_user_init,
-   .open   = pcap_open,
+   .open   = pcap_user_open,
.close  = NULL,
.remove = pcap_remove,
.add_address= NULL,
diff --git a/arch/um/drivers/slip_user.c b/arch/um/drivers/slip_user.c
index 8016d32b6809..482a19c5105c 100644
--- a/arch/um/drivers/slip_user.c
+++ b/arch/um/drivers/slip_user.c
@@ -9,7 +9,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c
index c4a0f26b2824..45d4164ad355 100644
--- a/arch/um/drivers/vector_user.c
+++ b/arch/um/drivers/vector_user.c
@@ -18,9 +18,7 @@
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -332,7 +330,7 @@ static struct vector_fds *user_init_unix_fds(struct arglist 
*ifspec, int id)
}
switch (id) {
case ID_BESS:
-   if (connect(fd, remote_addr, sizeof(struct sockaddr_un)) < 0) {
+   if (connect(fd, (const struct sockaddr *) remote_addr, 
sizeof(struct sockaddr_un)) < 0) {
printk(UM_KERN_ERR "bess open:cannot connect to %s %i", 
remote_addr->sun_path, -errno);
goto unix_cleanup;
}
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index ecf2f390fad2..07327425d06e 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -10,7 +10,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/um/user-offsets.c b/arch/x86/um/user-offsets.c
index c51dd8363d25..bae61554abcc 100644
--- a/arch/x86/um/user-offsets.c
+++ b/arch/x86/um/user-offsets.c
@@ -2,7 +2,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #define __FRAME_OFFSETS
-- 
2.20.1

[PATCH v3 3/3] um: allow static linking for non-glibc implementations

2020-07-15 Thread Ignat Korchagin

It is possible to produce a statically linked UML binary with UML_NET_VECTOR,
UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc
implementations, which do not rely on NSS, such as musl.

Allow static linking in this case.

Signed-off-by: Ignat Korchagin 
Reviewed-by: Brendan Higgins 
Tested-by: Brendan Higgins 
---
 arch/um/Kconfig | 5 +
 arch/um/drivers/Kconfig | 3 ---
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 9318dc6d1a0c..beb98b3b9f75 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -62,12 +62,9 @@ config NR_CPUS
 
 source "arch/$(HEADER_ARCH)/um/Kconfig"
 
-config FORBID_STATIC_LINK
-   bool
-
 config STATIC_LINK
bool "Force a static link"
-   depends on !FORBID_STATIC_LINK
+   depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || (!UML_NET_VECTOR && 
!UML_NET_VDE && !UML_NET_PCAP)
help
  This option gives you the ability to force a static link of UML.
  Normally, UML is linked as a shared binary.  This is inconvenient for
diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig
index 9160ead56e33..72d417055782 100644
--- a/arch/um/drivers/Kconfig
+++ b/arch/um/drivers/Kconfig
@@ -234,7 +234,6 @@ config UML_NET_DAEMON
 config UML_NET_VECTOR
bool "Vector I/O high performance network devices"
depends on UML_NET
-   select FORBID_STATIC_LINK
help
This User-Mode Linux network driver uses multi-message send
and receive functions. The host running the UML guest must have
@@ -246,7 +245,6 @@ config UML_NET_VECTOR
 config UML_NET_VDE
bool "VDE transport (obsolete)"
depends on UML_NET
-   select FORBID_STATIC_LINK
help
This User-Mode Linux network transport allows one or more running
UMLs on a single host to communicate with each other and also
@@ -294,7 +292,6 @@ config UML_NET_MCAST
 config UML_NET_PCAP
bool "pcap transport (obsolete)"
depends on UML_NET
-   select FORBID_STATIC_LINK
help
The pcap transport makes a pcap packet stream on the host look
like an ethernet device inside UML.  This is useful for making
-- 
2.20.1

[PATCH v3 1/3] um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS

2020-07-15 Thread Ignat Korchagin

For statically linked UML build it is important to take into account the
standard C-library implementation. Some implementations, notably glibc have
caveats: even when linked statically, the final program might require some
runtime dependencies, if certain functions are used within the code.

Consider the following program:
int main(void)
{
getpwent();
return 0;
}

Compiling this program and linking statically with glibc produces the following
warning from the linker:
/usr/sbin/ld: /tmp/ccuthw1o.o: in function `main':
test.c:(.text+0x5): warning: Using 'getpwent' in statically linked
applications requires at runtime the shared libraries from the glibc version
used for linking

We will use the flag to detect such C-library implementation build time and
possibly disable static linking for UML to avoid producing a binary with
unexpected behaviour and dependencies.

Signed-off-by: Ignat Korchagin 
Reviewed-by: Brendan Higgins 
Tested-by: Brendan Higgins 
---
 init/Kconfig   | 6 ++
 scripts/cc-can-link.sh | 5 +++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 0498af567f70..0a1ec56c9f33 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -57,6 +57,12 @@ config CC_CAN_LINK_STATIC
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) 
$(CLANG_FLAGS) $(m64-flag) -static) if 64BIT
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) 
$(CLANG_FLAGS) $(m32-flag) -static)
 
+config CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
+   bool
+   depends on UML && CC_CAN_LINK_STATIC
+   default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) 
$(CLANG_FLAGS) $(m64-flag) -static -Xlinker --fatal-warnings) if 64BIT
+   default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) 
$(CLANG_FLAGS) $(m32-flag) -static -Xlinker --fatal-warnings)
+
 config CC_HAS_ASM_GOTO
def_bool $(success,$(srctree)/scripts/gcc-goto.sh $(CC))
 
diff --git a/scripts/cc-can-link.sh b/scripts/cc-can-link.sh
index 6efcead31989..e5011a46103e 100755
--- a/scripts/cc-can-link.sh
+++ b/scripts/cc-can-link.sh
@@ -2,10 +2,11 @@
 # SPDX-License-Identifier: GPL-2.0
 
 cat << "END" | $@ -x c - -o /dev/null >/dev/null 2>&1
-#include 
+#include 
+#include 
 int main(void)
 {
-   printf("");
+   getpwent();
return 0;
 }
 END
-- 
2.20.1

[PATCH v4 2/3] um: some fixes to build UML with musl

2020-07-19 Thread Ignat Korchagin

musl toolchain and headers are a bit more strict. These fixes enable building
UML with musl as well as seem not to break on glibc.

Signed-off-by: Ignat Korchagin 
Tested-by: Brendan Higgins 
---
 arch/um/drivers/daemon_user.c |  1 +
 arch/um/drivers/pcap_user.c   | 12 ++--
 arch/um/drivers/slip_user.c   |  2 +-
 arch/um/drivers/vector_user.c |  4 +---
 arch/um/os-Linux/util.c   |  2 +-
 arch/x86/um/user-offsets.c|  2 +-
 6 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/arch/um/drivers/daemon_user.c b/arch/um/drivers/daemon_user.c
index 3695821d06a2..785baedc3555 100644
--- a/arch/um/drivers/daemon_user.c
+++ b/arch/um/drivers/daemon_user.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/um/drivers/pcap_user.c b/arch/um/drivers/pcap_user.c
index bbd20638788a..52ddda3e3b10 100644
--- a/arch/um/drivers/pcap_user.c
+++ b/arch/um/drivers/pcap_user.c
@@ -32,7 +32,7 @@ static int pcap_user_init(void *data, void *dev)
return 0;
 }
 
-static int pcap_open(void *data)
+static int pcap_user_open(void *data)
 {
struct pcap_data *pri = data;
__u32 netmask;
@@ -44,14 +44,14 @@ static int pcap_open(void *data)
if (pri->filter != NULL) {
err = dev_netmask(pri->dev, &netmask);
if (err < 0) {
-   printk(UM_KERN_ERR "pcap_open : dev_netmask failed\n");
+   printk(UM_KERN_ERR "pcap_user_open : dev_netmask 
failed\n");
return -EIO;
}
 
pri->compiled = uml_kmalloc(sizeof(struct bpf_program),
UM_GFP_KERNEL);
if (pri->compiled == NULL) {
-   printk(UM_KERN_ERR "pcap_open : kmalloc failed\n");
+   printk(UM_KERN_ERR "pcap_user_open : kmalloc failed\n");
return -ENOMEM;
}
 
@@ -59,14 +59,14 @@ static int pcap_open(void *data)
   (struct bpf_program *) pri->compiled,
   pri->filter, pri->optimize, netmask);
if (err < 0) {
-   printk(UM_KERN_ERR "pcap_open : pcap_compile failed - "
+   printk(UM_KERN_ERR "pcap_user_open : pcap_compile 
failed - "
   "'%s'\n", pcap_geterr(pri->pcap));
goto out;
}
 
err = pcap_setfilter(pri->pcap, pri->compiled);
if (err < 0) {
-   printk(UM_KERN_ERR "pcap_open : pcap_setfilter "
+   printk(UM_KERN_ERR "pcap_user_open : pcap_setfilter "
   "failed - '%s'\n", pcap_geterr(pri->pcap));
goto out;
}
@@ -127,7 +127,7 @@ int pcap_user_read(int fd, void *buffer, int len, struct 
pcap_data *pri)
 
 const struct net_user_info pcap_user_info = {
.init   = pcap_user_init,
-   .open   = pcap_open,
+   .open   = pcap_user_open,
.close  = NULL,
.remove = pcap_remove,
.add_address= NULL,
diff --git a/arch/um/drivers/slip_user.c b/arch/um/drivers/slip_user.c
index 8016d32b6809..482a19c5105c 100644
--- a/arch/um/drivers/slip_user.c
+++ b/arch/um/drivers/slip_user.c
@@ -9,7 +9,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c
index c4a0f26b2824..45d4164ad355 100644
--- a/arch/um/drivers/vector_user.c
+++ b/arch/um/drivers/vector_user.c
@@ -18,9 +18,7 @@
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -332,7 +330,7 @@ static struct vector_fds *user_init_unix_fds(struct arglist 
*ifspec, int id)
}
switch (id) {
case ID_BESS:
-   if (connect(fd, remote_addr, sizeof(struct sockaddr_un)) < 0) {
+   if (connect(fd, (const struct sockaddr *) remote_addr, 
sizeof(struct sockaddr_un)) < 0) {
printk(UM_KERN_ERR "bess open:cannot connect to %s %i", 
remote_addr->sun_path, -errno);
goto unix_cleanup;
}
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index ecf2f390fad2..07327425d06e 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -10,7 +10,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/um/user-offsets.c b/arch/x86/um/user-offsets.c
index c51dd8363d25..bae61554abcc 100644
--- a/arch/x86/um/user-offsets.c
+++ b/arch/x86/um/user-offsets.c
@@ -2,7 +2,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #define __FRAME_OFFSETS
-- 
2.20.1

[PATCH v4 3/3] um: allow static linking for non-glibc implementations

2020-07-19 Thread Ignat Korchagin

It is possible to produce a statically linked UML binary with UML_NET_VECTOR,
UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc
implementations, which do not rely on NSS, such as musl.

Allow static linking in this case.

Signed-off-by: Ignat Korchagin 
Reviewed-by: Brendan Higgins 
Tested-by: Brendan Higgins 
---
 arch/um/Kconfig | 6 +++---
 arch/um/drivers/Kconfig | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 9318dc6d1a0c..d888b859c167 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -62,12 +62,12 @@ config NR_CPUS
 
 source "arch/$(HEADER_ARCH)/um/Kconfig"
 
-config FORBID_STATIC_LINK
-   bool
+config MAY_HAVE_RUNTIME_DEPS
+bool
 
 config STATIC_LINK
bool "Force a static link"
-   depends on !FORBID_STATIC_LINK
+   depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || !MAY_HAVE_RUNTIME_DEPS
help
  This option gives you the ability to force a static link of UML.
  Normally, UML is linked as a shared binary.  This is inconvenient for
diff --git a/arch/um/drivers/Kconfig b/arch/um/drivers/Kconfig
index 9160ead56e33..2e7b8e0e7194 100644
--- a/arch/um/drivers/Kconfig
+++ b/arch/um/drivers/Kconfig
@@ -234,7 +234,7 @@ config UML_NET_DAEMON
 config UML_NET_VECTOR
bool "Vector I/O high performance network devices"
depends on UML_NET
-   select FORBID_STATIC_LINK
+   select MAY_HAVE_RUNTIME_DEPS
help
This User-Mode Linux network driver uses multi-message send
and receive functions. The host running the UML guest must have
@@ -246,7 +246,7 @@ config UML_NET_VECTOR
 config UML_NET_VDE
bool "VDE transport (obsolete)"
depends on UML_NET
-   select FORBID_STATIC_LINK
+   select MAY_HAVE_RUNTIME_DEPS
help
This User-Mode Linux network transport allows one or more running
UMLs on a single host to communicate with each other and also
@@ -294,7 +294,7 @@ config UML_NET_MCAST
 config UML_NET_PCAP
bool "pcap transport (obsolete)"
depends on UML_NET
-   select FORBID_STATIC_LINK
+   select MAY_HAVE_RUNTIME_DEPS
help
The pcap transport makes a pcap packet stream on the host look
like an ethernet device inside UML.  This is useful for making
-- 
2.20.1

[PATCH v4 1/3] um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS

2020-07-19 Thread Ignat Korchagin

For statically linked UML build it is important to take into account the
standard C-library implementation. Some implementations, notably glibc have
caveats: even when linked statically, the final program might require some
runtime dependencies, if certain functions are used within the code.

Consider the following program:
int main(void)
{
getpwent();
return 0;
}

Compiling this program and linking statically with glibc produces the following
warning from the linker:
/usr/sbin/ld: /tmp/ccuthw1o.o: in function `main':
test.c:(.text+0x5): warning: Using 'getpwent' in statically linked
applications requires at runtime the shared libraries from the glibc version
used for linking

We will use the flag to detect such C-library implementation build time and
possibly disable static linking for UML to avoid producing a binary with
unexpected behaviour and dependencies.

Signed-off-by: Ignat Korchagin 
Reviewed-by: Brendan Higgins 
Tested-by: Brendan Higgins 
---
 init/Kconfig   | 6 ++
 scripts/cc-can-link.sh | 5 +++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 0498af567f70..0a1ec56c9f33 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -57,6 +57,12 @@ config CC_CAN_LINK_STATIC
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) 
$(CLANG_FLAGS) $(m64-flag) -static) if 64BIT
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) 
$(CLANG_FLAGS) $(m32-flag) -static)
 
+config CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
+   bool
+   depends on UML && CC_CAN_LINK_STATIC
+   default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) 
$(CLANG_FLAGS) $(m64-flag) -static -Xlinker --fatal-warnings) if 64BIT
+   default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) 
$(CLANG_FLAGS) $(m32-flag) -static -Xlinker --fatal-warnings)
+
 config CC_HAS_ASM_GOTO
def_bool $(success,$(srctree)/scripts/gcc-goto.sh $(CC))
 
diff --git a/scripts/cc-can-link.sh b/scripts/cc-can-link.sh
index 6efcead31989..e5011a46103e 100755
--- a/scripts/cc-can-link.sh
+++ b/scripts/cc-can-link.sh
@@ -2,10 +2,11 @@
 # SPDX-License-Identifier: GPL-2.0
 
 cat << "END" | $@ -x c - -o /dev/null >/dev/null 2>&1
-#include 
+#include 
+#include 
 int main(void)
 {
-   printf("");
+   getpwent();
return 0;
 }
 END
-- 
2.20.1

[PATCH v4 0/3] um: allow static linking for non-glibc libc implementations

2020-07-19 Thread Ignat Korchagin

Changes from v3:
  * restored FORBID_STATIC_LINK and renamed it to MAY_HAVE_RUNTIME_DEPS, so we
don't have to maintain a static list of options, which may produce runtime
dependencies

This is a continuation of [1]. Since I was able to produce a working UML binary
with UML_NET_VECTOR linked with musl with the changes included in the patches
here. I was compiling on Arch Linux, so hopefully all the latest versions of
the compiler, libraries and binutils.

I also tested allyesconfig with both musl and glibc. The compilation succeeds
with both, however both binaries (glibc one being dynamically linked) segfault
on start. This is probably of some incompatible config option/module being
included and not related to musl/glibc.

[1]: 
https://patchwork.ozlabs.org/project/linux-um/patch/20200624212319.403689-1-ig...@cloudflare.com/

Ignat Korchagin (3):
  um/kconfig: introduce CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
  um: some fixes to build UML with musl
  um: allow static linking for non-glibc implementations

 arch/um/Kconfig   |  6 +++---
 arch/um/drivers/Kconfig   |  6 +++---
 arch/um/drivers/daemon_user.c |  1 +
 arch/um/drivers/pcap_user.c   | 12 ++--
 arch/um/drivers/slip_user.c   |  2 +-
 arch/um/drivers/vector_user.c |  4 +---
 arch/um/os-Linux/util.c   |  2 +-
 arch/x86/um/user-offsets.c|  2 +-
 init/Kconfig  |  6 ++
 scripts/cc-can-link.sh|  5 +++--
 10 files changed, 26 insertions(+), 20 deletions(-)

-- 
2.20.1

Re: [PATCH v3 3/3] um: allow static linking for non-glibc implementations

2020-07-19 Thread Ignat Korchagin

On Thu, Jul 16, 2020 at 10:10 AM Johannes Berg
 wrote:
>
> On Wed, 2020-07-15 at 21:11 +0100, Ignat Korchagin wrote:
> > It is possible to produce a statically linked UML binary with 
> > UML_NET_VECTOR,
> > UML_NET_VDE and UML_NET_PCAP options enabled using alternative libc
> > implementations, which do not rely on NSS, such as musl.
> >
> > Allow static linking in this case.
> >
> > Signed-off-by: Ignat Korchagin 
> > Reviewed-by: Brendan Higgins 
> > Tested-by: Brendan Higgins 
> > ---
> >  arch/um/Kconfig | 5 +
> >  arch/um/drivers/Kconfig | 3 ---
> >  2 files changed, 1 insertion(+), 7 deletions(-)
> >
> > diff --git a/arch/um/Kconfig b/arch/um/Kconfig
> > index 9318dc6d1a0c..beb98b3b9f75 100644
> > --- a/arch/um/Kconfig
> > +++ b/arch/um/Kconfig
> > @@ -62,12 +62,9 @@ config NR_CPUS
> >
> >  source "arch/$(HEADER_ARCH)/um/Kconfig"
> >
> > -config FORBID_STATIC_LINK
> > - bool
> > -
> >  config STATIC_LINK
> >   bool "Force a static link"
> > - depends on !FORBID_STATIC_LINK
> > + depends on CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS || (!UML_NET_VECTOR && 
> > !UML_NET_VDE && !UML_NET_PCAP)
>
> Come to think of it, in a way "FORBID_STATIC_LINK" was nicer because
> there didn't need to be a single list of "what has dynamic dependencies"
> like here the list of UML_NET_VECTOR, UML_NET_VDE, UML_NET_PCAP.
>
> Maybe it could be
>
> config MAY_HAVE_NON_STATIC_RUNTIME_DEPS
> bool
>
> config STATIC_LINK
> ...
> depends on !MAY_HAVE_NON_STATIC_RUNTIME_DEPS || 
> CC_CAN_LINK_STATIC_NO_RUNTIME_DEPS
>
>
> and then UML_NET_VECTOR et al can
>
> select MAY_HAVE_NON_STATIC_RUNTIME_DEPS
>
> so that the knowledge is still distributed to the corresponding options?

Yes, makes sense. I've shortened it to MAY_HAVE_RUNTIME_DEPS and
reposted the series.

> johannes
>

[PATCH v3] dm crypt: add flags to optionally bypass dm-crypt workqueues

2020-07-06 Thread Ignat Korchagin

Changes from v2:
  * dropped nobacklog boolean - ciphers are OK to backlog requests
  * moved some conditionals inline dropping the extra local variables
  * renamed "noresched" -> "atomic"

This is a follow up from [1]. Consider the following script:

sudo modprobe brd rd_nr=1 rd_size=4194304

echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \
sudo dmsetup create eram0

echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
no_write_workqueue' | \
sudo dmsetup create eram0-inline-write

echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
no_read_workqueue' | \
sudo dmsetup create eram0-inline-read

devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read "
devices+="/dev/mapper/eram0-inline-write"

for dev in $devices; do
  echo "reading from $dev"
  sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \
  --loops=100 --runtime=3m --name=plain | grep READ
done

for dev in $devices; do
  echo "writing to $dev"
  sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \
  --loops=100 --runtime=3m --name=plain | grep WRITE
done

This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
three dm-crypt instances on top. All dm-crypt instances use the NULL cipher
to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
for "encyption"). The first instance is the current dm-crypt implementation from
5.8-rc2, the two others have new optional flags enabled, which bypass kcryptd
workqueues for reads and writes respectively and write sorting for writes. On
my VM (Debian in VirtualBox with 4 cores on 2.8 GHz Quad-Core Intel Core i7) I
get the following output (formatted for better readability):

reading from /dev/ram0
   READ: bw=508MiB/s (533MB/s), 508MiB/s-508MiB/s (533MB/s-533MB/s), io=89.3GiB 
(95.9GB), run=18-18msec

reading from /dev/mapper/eram0
   READ: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), 
io=14.2GiB (15.2GB), run=18-18msec

reading from /dev/mapper/eram0-inline-read
   READ: bw=295MiB/s (309MB/s), 295MiB/s-295MiB/s (309MB/s-309MB/s), io=51.8GiB 
(55.6GB), run=18-18msec

reading from /dev/mapper/eram0-inline-write
   READ: bw=114MiB/s (120MB/s), 114MiB/s-114MiB/s (120MB/s-120MB/s), io=20.1GiB 
(21.5GB), run=18-18msec

writing to /dev/ram0
  WRITE: bw=516MiB/s (541MB/s), 516MiB/s-516MiB/s (541MB/s-541MB/s), io=90.7GiB 
(97.4GB), run=180001-180001msec

writing to /dev/mapper/eram0
  WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), 
io=7271MiB (7624MB), run=180001-180001msec

writing to /dev/mapper/eram0-inline-read
  WRITE: bw=38.9MiB/s (40.8MB/s), 38.9MiB/s-38.9MiB/s (40.8MB/s-40.8MB/s), 
io=7000MiB (7340MB), run=180001-180001msec

writing to /dev/mapper/eram0-inline-write
  WRITE: bw=277MiB/s (290MB/s), 277MiB/s-277MiB/s (290MB/s-290MB/s), io=48.6GiB 
(52.2GB), run=18-18msec

Current dm-crypt implementation creates a significant IO performance overhead
(at least on small IO block sizes) for both latency and throughput. We suspect
offloading IO request processing into workqueues and async threads is more
harmful these days with the modern fast storage. I also did some digging into
the dm-crypt git history and much of this async processing is not needed
anymore, because the reasons it was added are mostly gone from the kernel. More
details can be found in [2] (see "Git archeology" section).

This change adds no_(read|write)_workqueue flags separately for read and write
BIOs, which direct dm-crypt not to offload crypto operations into kcryptd
workqueues and process everything inline. In addition, writes are not buffered
to be sorted in the dm-crypt red-black tree, but dispatched immediately. For
cases, where crypto operations cannot happen inline (hard interrupt context,
for example the read path of some NVME drivers), we offload the work to a
tasklet rather than a workqueue.

These flags ensure inline BIO processing in the dm-crypt module only. It is
worth noting that some Crypto API implementations may offload encryption into
their own workqueues, which are independent of the dm-crypt and its
configuration. However upon enabling no_(read|write)_workqueue flags dm-crypt
will instruct Crypto API not to backlog crypto requests.

[1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
[2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 50 ---
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 000ddfab5ba0..7536ecb2c95d 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -69,6 +69,7 @@ struct dm_crypt_io {
u8 *integrity_metadata;
bool integrity_metadata_from_pool;
struct work_struc

Re: [dm-devel] [PATCH v2] dm crypt: add flags to optionally bypass dm-crypt workqueues

2020-07-06 Thread Ignat Korchagin

On Mon, Jul 6, 2020 at 3:28 PM Bob Liu  wrote:
>
> Hi Ignat,
>
> On 6/27/20 5:03 AM, Ignat Korchagin wrote:
> > This is a follow up from [1]. Consider the following script:
> >
> > sudo modprobe brd rd_nr=1 rd_size=4194304
> >
>
> Did you test null_blk device? I didn't get result as expected using null_blk.

I've modified the script in the patch description to use a null_blk
device instead of /dev/ram0, but I created the block device with
queue_mode=0. My results are as follows:

reading from /dev/nullb0
   READ: bw=527MiB/s (552MB/s), 527MiB/s-527MiB/s (552MB/s-552MB/s),
io=92.6GiB (99.4GB), run=18-18msec
reading from /dev/mapper/eram0
   READ: bw=112MiB/s (117MB/s), 112MiB/s-112MiB/s (117MB/s-117MB/s),
io=19.7GiB (21.1GB), run=180001-180001msec
reading from /dev/mapper/eram0-inline-read
   READ: bw=288MiB/s (302MB/s), 288MiB/s-288MiB/s (302MB/s-302MB/s),
io=50.6GiB (54.3GB), run=18-18msec
reading from /dev/mapper/eram0-inline-write
   READ: bw=123MiB/s (129MB/s), 123MiB/s-123MiB/s (129MB/s-129MB/s),
io=21.6GiB (23.1GB), run=18-18msec
writing to /dev/nullb0
  WRITE: bw=510MiB/s (535MB/s), 510MiB/s-510MiB/s (535MB/s-535MB/s),
io=89.7GiB (96.3GB), run=180001-180001msec
writing to /dev/mapper/eram0
  WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s
(48.1MB/s-48.1MB/s), io=8264MiB (8666MB), run=180001-180001msec
writing to /dev/mapper/eram0-inline-read
  WRITE: bw=42.5MiB/s (44.6MB/s), 42.5MiB/s-42.5MiB/s
(44.6MB/s-44.6MB/s), io=7658MiB (8030MB), run=180001-180001msec
writing to /dev/mapper/eram0-inline-write
  WRITE: bw=259MiB/s (272MB/s), 259MiB/s-259MiB/s (272MB/s-272MB/s),
io=45.6GiB (48.0GB), run=18-18msec

Still see the same improvement. Can you retry your test with a
queue_mode=0 null_blk device?

Thanks,
Ignat

> 1.
> # fio --filename=/dev/nullb0 --readwrite=readwrite --bs=4k --direct=1 --loops 
>=100 --name=plain
> Run status group 0 (all jobs):
>   READ: bw=390MiB/s (409MB/s), 390MiB/s-390MiB/s (409MB/s-409MB/s), 
> io=10.6GiB (11.3GB), run=27744-27744msec
>   WRITE: bw=390MiB/s (409MB/s), 390MiB/s-390MiB/s (409MB/s-409MB/s), 
> io=10.6GiB (11.3GB), run=27744-27744msec
>
> 2.
> Create enctrypted-ram0 based on null_blk(without this patch):
> # cryptsetup open --header crypthdr.img /dev/nullb0 encrypted-ram0
> # fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k 
> --direct=1 --loops=100 --name=crypt
> Run status group 0 (all jobs):
>   READ: bw=180MiB/s (188MB/s), 180MiB/s-180MiB/s (188MB/s-188MB/s), 
> io=4534MiB (4754MB), run=25246-25246msec
>   WRITE: bw=179MiB/s (188MB/s), 179MiB/s-179MiB/s (188MB/s-188MB/s), 
> io=4528MiB (4748MB), run=25246-25246msec
>
> 3.
> Create enctrypted-ram0 based on null_blk(with this patch):
> # cryptsetup open --header crypthdr.img /dev/nullb0 encrypted-ram0
> # fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k  
> --direct=1 --loops=100 --name=crypt.patched
> Run status group 0 (all jobs):
>   READ: bw=149MiB/s (156MB/s), 149MiB/s-149MiB/s (156MB/s-156MB/s), 
> io=4128MiB (4329MB), run=27753-27753msec
>   WRITE: bw=149MiB/s (156MB/s), 149MiB/s-149MiB/s (156MB/s-156MB/s), 
> io=4124MiB (4324MB), run=27753-27753msec
>
> Looks like the result is worse after this patch, or I may miss something..
>
> Regards,
> Bob
>
>
> > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \
> > sudo dmsetup create eram0
> >
> > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> > no_write_workqueue' | \
> > sudo dmsetup create eram0-inline-write
> >
> > echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> > no_read_workqueue' | \
> > sudo dmsetup create eram0-inline-read
> >
> > devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read "
> > devices+="/dev/mapper/eram0-inline-write"
> >
> > for dev in $devices; do
> >   echo "reading from $dev"
> >   sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \
> >   --loops=100 --runtime=3m --name=plain | grep READ
> > done
> >
> > for dev in $devices; do
> >   echo "writing to $dev"
> >   sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \
> >   --loops=100 --runtime=3m --name=plain | grep WRITE
> > done
> >
> > This script creates a ramdisk (to eliminate hardware bias in the benchmark) 
> > and
> > three dm-crypt instances on top. All dm-crypt instances use the NULL cipher
> > to eliminate potentially expensive crypto bias (the NULL cipher just uses 
> > memcpy
> > for "encyption"). The first instance is the current dm-crypt implementation 
&g

Re: [RFC PATCH] Revert "um: Make CONFIG_STATIC_LINK actually static"

2020-06-30 Thread Ignat Korchagin

On Tue, Jun 30, 2020 at 11:47 PM Brendan Higgins
 wrote:
>
> On Wed, Jun 24, 2020 at 2:23 PM Ignat Korchagin  wrote:
> >
> > This reverts commit 3363179385629c1804ea846f4e72608c2201a81e.
> >
> > This change is too restrictive. I've been running UML statically linked 
> > kernel
> >  with UML_NET_VECTOR networking in a docker "FROM: scratch" container just 
> > fine.
> > As long as we don't reference network peers by hostname and use only IP
> > addresses, NSS is not needed, so not used. In other words, it is possible to
> > have statically linked UML and UML_NET_VECTOR (and other networking types) 
> > and
> > use it, although with some restrictions, so let's not disable it.
> >
> > Additionally, it should be at least theoretically possible to use another 
> > libc
> > (like musl, bionic etc) for static linking. I was able with some hacks to
> > compile UML against musl, although the executable segfaults for now. But 
> > this
> > option prevents even the research to be done.
>
> The reason that we removed support for static linking when these
> networking options are enabled is because gcc doesn't support loading
> NSS when statically linked, which consequently breaks allyesconfig for
> UML under gcc. That is still the case with your revert.

Yes, sure. But I'm not only "researching", but using UML as a "router"
in one of my dev setups. 3363179385629c1804ea846f4e72608c2201a81e
mentions UML_NET_VECTOR incompatibility (and some other networking
options), which I had enabled and actually my whole dev networking is
based around UML_NET_VECTOR: I have two interfaces - one in raw mode
and one doing ipsec. All this was running in an empty "FROM: scratch"
container and obviously linked statically.

If the static linking breaks some other config options in allyesconfig
- that's another story, but I wanted to point out that config options
mentioned in the commit message worked just fine (if not trying to
resolve hostnames). In other words: if you don't resolve - glibc will
not try to load NSS. glibc-nss is a known problem and I would assume
most people trying to do static linking are aware of this - that is,
if they choose this path they are willing to live with the
consequences. That's why completely disabling the possibility sounds
too restrictive for me.

> I fully support the goal behind what you are trying to do. However, I
> do not want to see this patch accepted unless it is accompanied by an
> alternative change that still allows UML to compile under
> allyesconfig.

If I succeed linking it to musl (or other non-glibc lib), would that be enough?

> You said that in the current state, researching a solution is
> possible? Can you not research a solution with your patch applied to
> your own branch?


As a side note: I tried to revert this patch and statically link 5.7
UML with glibc, but the binary still segfaults on start, so I would
assume it is not related to my previous attempts linking with musl.

Regards,
Ignat

>
> Cheers

Re: ipmi_msghandler crashes in 4.19

2019-01-30 Thread Ignat Korchagin

We're rolling out 4.19.18 across the fleet. Hopefully, we'll not see
it anymore, but if we do, we'll let you know.

Regards,
Ignat

On Tue, Jan 29, 2019 at 10:29 AM Greg KH  wrote:
>
> On Tue, Jan 15, 2019 at 10:36:42AM -0800, Ivan Babrou wrote:
> > Hey,
> >
> > We've upgraded some machines from 4.14 to 4.19 and started seeing rare
> > crashes like these:
> >
> > [75855.909507] BUG: unable to handle kernel NULL pointer dereference
> > at 0d00
> > [75855.925667] PGD 0 P4D 0
> > [75855.936359] Oops:  [#1] SMP PTI
> > [75855.947951] CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G   O
> >  4.19.13-cloudflare-2019.1.4 #2019.1.4
> > [75855.966028] Hardware name: Quanta Cloud Technology Inc. QuantaPlex
> > T42S-2U(LBG-4) -/T42S-2U MB (Lewisburg-4), BIOS 3A11.Q10 06/29/2018
> > [75855.994246] RIP: 0010:__srcu_read_unlock+0xe/0x20
> > [75856.006851] Code: 01 48 63 c8 65 48 ff 04 ca f0 83 44 24 fc 00 c3
> > 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00
> > 48 63 f6 <48> 8b 87 e8 0c 00 00 65 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f
> > 44 00
> > [75856.041551] RSP: 0018:ba00cc66fd48 EFLAGS: 00010286
> > [75856.054564] RAX:  RBX:  RCX: 
> > 
> > [75856.069449] RDX: 0001 RSI:  RDI: 
> > 0018
> > [75856.084168] RBP: a28276abb200 R08: a29119772540 R09: 
> > 
> > [75856.098756] R10: 000c1425 R11: a29120a201c8 R12: 
> > a29118d57e08
> > [75856.113422] R13: dead0200 R14: dead0100 R15: 
> > a27dcbafa400
> > [75856.127798] FS:  () GS:a29120a0()
> > knlGS:
> > [75856.138973] perf: interrupt took too long (7735 > 7677), lowering
> > kernel.perf_event_max_sample_rate to 25000
> > [75856.143083] CS:  0010 DS:  ES:  CR0: 80050033
> > [75856.172956] CR2: 0d00 CR3: 00187ca0a005 CR4: 
> > 007606f0
> > [75856.187116] DR0:  DR1:  DR2: 
> > 
> > [75856.201312] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [75856.215274] PKRU: 5554
> > [75856.224621] Call Trace:
> > [75856.230942] perf: interrupt took too long (9748 > 9668), lowering
> > kernel.perf_event_max_sample_rate to 2
> > [75856.233560]  deliver_response+0x88/0xd0 [ipmi_msghandler]
> > [75856.261744]  deliver_local_response+0xe/0x30 [ipmi_msghandler]
> > [75856.273937]  handle_one_recv_msg+0x164/0xbf0 [ipmi_msghandler]
> > [75856.285962]  ? __switch_to_asm+0x34/0x70
> > [75856.295957]  ? __switch_to_asm+0x40/0x70
> > [75856.306011]  ? __switch_to_asm+0x34/0x70
> > [75856.315872]  ? __switch_to_asm+0x40/0x70
> > [75856.325562]  ? __switch_to_asm+0x34/0x70
> > [75856.325565]  ? __switch_to_asm+0x40/0x70
> > [75856.325567]  ? __switch_to_asm+0x34/0x70
> > [75856.325569]  ? __switch_to_asm+0x40/0x70
> > [75856.325578]  handle_new_recv_msgs+0x16d/0x1e0 [ipmi_msghandler]
> > [75856.325583]  ? __switch_to_asm+0x34/0x70
> > [75856.381815]  tasklet_action_common.isra.21+0x4e/0xf0
> > [75856.381823]  __do_softirq+0xd8/0x2d2
> > [75856.399498]  ? sort_range+0x20/0x20
> > [75856.399506]  run_ksoftirqd+0x1a/0x20
> > [75856.415184]  smpboot_thread_fn+0xc5/0x160
> > [75856.415190]  kthread+0x113/0x130
> > [75856.430502]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [75856.430512]  ret_from_fork+0x35/0x40
> > [75856.446793] Modules linked in: xt_connlimit nf_conncount xt_bpf
> > xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> > algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> > ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> > ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TPROXY
> > nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark iptable_mangle xt_owner
> > xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6 iptable_raw
> > nfnetlink_log xt_NFLOG xt_tcpudp xt_comment xt_conntrack nf_conntrack
> > nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_multiport xt_set
> > iptable_filter bpfilter ip_set_hash_netport ip_set_hash_net
> > ip_set_hash_ip ip_set nfnetlink 8021q garp mrp stp llc skx_edac
> > x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32_pclmul crc32c_intel
> > ipmi_ssif pcbc aesni_intel aes_x86_64 crypto_simd sfc(O)
> > [75856.446862]  cryptd glue_helper mdio ipmi_si xhci_pci i40e tpm_crb
> > ioatdma ipmi_devintf xhci_hcd dca ipmi_msghandler tpm_tis tpm_tis_core
> > tpm efivarfs ip_tables x_tables
> > [75856.569103] CR2: 0d00
> > [75856.569124] ---[ end trace 604e13a0789ee766 ]---
> >
> > [117620.868720] general protection fault:  [#1] SMP PTI
> > [117620.911871] CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G
> > O  4.19.0-cloudflare-2018.10.3 #1
> > [117620.937885] Hardware name: Quanta Computer Inc QuantaPlex
> > T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
> > [117620.963750] RIP: 0010:__srcu_read_unlock+0xe/0x20
> > [117620.984950] Code: 01 48 63 c8 65 48 ff 04 ca f0 83 44 24 fc

[PATCH v2] dm crypt: add flags to optionally bypass dm-crypt workqueues

2020-06-26 Thread Ignat Korchagin

This is a follow up from [1]. Consider the following script:

sudo modprobe brd rd_nr=1 rd_size=4194304

echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \
sudo dmsetup create eram0

echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
no_write_workqueue' | \
sudo dmsetup create eram0-inline-write

echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
no_read_workqueue' | \
sudo dmsetup create eram0-inline-read

devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read "
devices+="/dev/mapper/eram0-inline-write"

for dev in $devices; do
  echo "reading from $dev"
  sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \
  --loops=100 --runtime=3m --name=plain | grep READ
done

for dev in $devices; do
  echo "writing to $dev"
  sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \
  --loops=100 --runtime=3m --name=plain | grep WRITE
done

This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
three dm-crypt instances on top. All dm-crypt instances use the NULL cipher
to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
for "encyption"). The first instance is the current dm-crypt implementation from
5.8-rc2, the two others have new optional flags enabled, which bypass kcryptd
workqueues for reads and writes respectively and write sorting for writes. On
my VM (Debian in VirtualBox with 4 cores on 2.8 GHz Quad-Core Intel Core i7) I
get the following output (formatted for better readability):

reading from /dev/ram0
   READ: bw=508MiB/s (533MB/s), 508MiB/s-508MiB/s (533MB/s-533MB/s), io=89.3GiB 
(95.9GB), run=18-18msec

reading from /dev/mapper/eram0
   READ: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), 
io=14.2GiB (15.2GB), run=18-18msec

reading from /dev/mapper/eram0-inline-read
   READ: bw=295MiB/s (309MB/s), 295MiB/s-295MiB/s (309MB/s-309MB/s), io=51.8GiB 
(55.6GB), run=18-18msec

reading from /dev/mapper/eram0-inline-write
   READ: bw=114MiB/s (120MB/s), 114MiB/s-114MiB/s (120MB/s-120MB/s), io=20.1GiB 
(21.5GB), run=18-18msec

writing to /dev/ram0
  WRITE: bw=516MiB/s (541MB/s), 516MiB/s-516MiB/s (541MB/s-541MB/s), io=90.7GiB 
(97.4GB), run=180001-180001msec

writing to /dev/mapper/eram0
  WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), 
io=7271MiB (7624MB), run=180001-180001msec

writing to /dev/mapper/eram0-inline-read
  WRITE: bw=38.9MiB/s (40.8MB/s), 38.9MiB/s-38.9MiB/s (40.8MB/s-40.8MB/s), 
io=7000MiB (7340MB), run=180001-180001msec

writing to /dev/mapper/eram0-inline-write
  WRITE: bw=277MiB/s (290MB/s), 277MiB/s-277MiB/s (290MB/s-290MB/s), io=48.6GiB 
(52.2GB), run=18-18msec

Current dm-crypt implementation creates a significant IO performance overhead
(at least on small IO block sizes) for both latency and throughput. We suspect
offloading IO request processing into workqueues and async threads is more
harmful these days with the modern fast storage. I also did some digging into
the dm-crypt git history and much of this async processing is not needed
anymore, because the reasons it was added are mostly gone from the kernel. More
details can be found in [2] (see "Git archeology" section).

This change adds no_(read|write)_workqueue flags separately for read and write
BIOs, which direct dm-crypt not to offload crypto operations into kcryptd
workqueues and process everything inline. In addition, writes are not buffered
to be sorted in the dm-crypt red-black tree, but dispatched immediately. For
cases, where crypto operations cannot happen inline (hard interrupt context,
for example the read path of some NVME drivers), we offload the work to a
tasklet rather than a workqueue.

These flags ensure inline BIO processing in the dm-crypt module only. It is
worth noting that some Crypto API implementations may offload encryption into
their own workqueues, which are independent of the dm-crypt and its
configuration. However upon enabling no_(read|write)_workqueue flags dm-crypt
will instruct Crypto API not to backlog crypto requests.

[1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
[2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 68 +--
 1 file changed, 52 insertions(+), 16 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 000ddfab5ba0..6924eb49b1df 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -69,6 +69,7 @@ struct dm_crypt_io {
u8 *integrity_metadata;
bool integrity_metadata_from_pool;
struct work_struct work;
+   struct tasklet_struct tasklet;
 
struct convert_context ctx;
 
@@ -127,7 +128,8 @@ struct iv_elephant_private {
  * and encrypts / decrypts at the same time.
  */
 enum flags { D

[RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-19 Thread Ignat Korchagin

This is a follow up from the long-forgotten [1], but with some more convincing
evidence. Consider the following script:

#!/bin/bash -e

# create 4G ramdisk
sudo modprobe brd rd_nr=1 rd_size=4194304

# create a dm-crypt device with NULL cipher on top of /dev/ram0
echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup 
create eram0

# create a dm-crypt device with NULL cipher and custom force_inline flag
echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | 
sudo dmsetup create inline-eram0

# read all data from /dev/ram0
sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum

# read the same data from /dev/mapper/eram0
sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum

# read the same data from /dev/mapper/inline-eram0
sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum

This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
for "encyption"). The first instance is the current dm-crypt implementation from
5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from
the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
better readability):

# plain ram0
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -

# eram0 (current dm-crypt)
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -

# inline-eram0 (patched dm-crypt)
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -

As we can see, current dm-crypt implementation creates a significant IO
performance overhead (at least on small IO block sizes) for both latency and
throughput. We suspect offloading IO request processing into workqueues and
async threads is more harmful these days with the modern fast storage. I also
did some digging into the dm-crypt git history and much of this async processing
is not needed anymore, because the reasons it was added are mostly gone from the
kernel. More details can be found in [2] (see "Git archeology" section).

We have been running the attached patch on different hardware generations in
more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
happy with the performance benefits.

[1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
[2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Ignat Korchagin (1):
  Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

 drivers/md/dm-crypt.c | 55 +--
 1 file changed, 43 insertions(+), 12 deletions(-)

-- 
2.20.1

[RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

2020-06-19 Thread Ignat Korchagin

Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is
especially visible on busy systems with many processes/threads. Moreover, most
Crypto API implementaions are async, that is they offload crypto operations on
their own, so this dm-crypt offloading is excessive.

This adds a new flag, which directs dm-crypt not to offload crypto operations
and process everything inline. For cases, where crypto operations cannot happen
inline (hard interrupt context, for example the read path of the NVME driver),
we offload the work to a tasklet rather than a workqueue.

Signed-off-by: Ignat Korchagin 
---
 drivers/md/dm-crypt.c | 55 +--
 1 file changed, 43 insertions(+), 12 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 000ddfab5ba0..5a9bac4fdffb 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -69,6 +69,7 @@ struct dm_crypt_io {
u8 *integrity_metadata;
bool integrity_metadata_from_pool;
struct work_struct work;
+   struct tasklet_struct tasklet;
 
struct convert_context ctx;
 
@@ -127,7 +128,7 @@ struct iv_elephant_private {
  * and encrypts / decrypts at the same time.
  */
 enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
-DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
+DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = 
(sizeof(unsigned long) * 8 - 1) };
 
 enum cipher_flags {
CRYPT_MODE_INTEGRITY_AEAD,  /* Use authenticated mode for cihper */
@@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct 
crypt_config *cc,
 
skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
 
-   /*
-* Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
-* requests if driver request queue is full.
-*/
-   skcipher_request_set_callback(ctx->r.req,
-   CRYPTO_TFM_REQ_MAY_BACKLOG,
-   kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
+   if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
+   /* make sure we zero important fields of the request */
+   skcipher_request_set_callback(ctx->r.req,
+   0, NULL, NULL);
+   else
+   /*
+* Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
+* requests if driver request queue is full.
+*/
+   skcipher_request_set_callback(ctx->r.req,
+   CRYPTO_TFM_REQ_MAY_BACKLOG,
+   kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
 }
 
 static void crypt_alloc_req_aead(struct crypt_config *cc,
@@ -1566,7 +1572,8 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
atomic_dec(&ctx->cc_pending);
ctx->cc_sector += sector_step;
tag_offset++;
-   cond_resched();
+   if (!test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
+   cond_resched();
continue;
/*
 * There was a data integrity error.
@@ -1892,6 +1899,11 @@ static void kcryptd_crypt_write_io_submit(struct 
dm_crypt_io *io, int async)
 
clone->bi_iter.bi_sector = cc->start + io->sector;
 
+   if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) {
+   generic_make_request(clone);
+   return;
+   }
+
if (likely(!async) && test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags)) {
generic_make_request(clone);
return;
@@ -2031,12 +2043,26 @@ static void kcryptd_crypt(struct work_struct *work)
kcryptd_crypt_write_convert(io);
 }
 
+static void kcryptd_crypt_tasklet(unsigned long work)
+{
+   kcryptd_crypt((struct work_struct *)work);
+}
+
 static void kcryptd_queue_crypt(struct dm_crypt_io *io)
 {
struct crypt_config *cc = io->cc;
 
-   INIT_WORK(&io->work, kcryptd_crypt);
-   queue_work(cc->crypt_queue, &io->work);
+   if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) {
+   if (in_irq()) {
+   /* Crypto API will fail hard in hard IRQ context */
+   tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, 
(unsigned long)&io->work);
+   tasklet_schedule(&io->tasklet);
+   } else
+   kcryptd_crypt(&io->work);
+   } else {
+   INIT_WORK(&io->work, kcryptd_crypt);
+   queue_work(cc->crypt_queue, &io->work);
+   }
 }
 
 static void crypt_free_tfms_aead(struct crypt_config *cc)
@@ -2838,7 +2864,7 @@ static int crypt_ctr_optional(struct dm_target *ti, 
unsigned int argc, char **ar
struct crypt_config *cc = ti->private;
struct dm_arg_set as;

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-19 Thread Ignat Korchagin

On Fri, Jun 19, 2020 at 7:39 PM Mikulas Patocka  wrote:
>
>
>
> On Fri, 19 Jun 2020, Mike Snitzer wrote:
>
> > On Fri, Jun 19 2020 at 12:41pm -0400,
> > Ignat Korchagin  wrote:
> >
> > > This is a follow up from the long-forgotten [1], but with some more 
> > > convincing
> > > evidence. Consider the following script:
> > >
> > > [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> > > [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> > >
> > > Ignat Korchagin (1):
> > >   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> > >
> > >  drivers/md/dm-crypt.c | 55 +--
> > >  1 file changed, 43 insertions(+), 12 deletions(-)
> > >
> > > --
> > > 2.20.1
> > >
> >
> > Hi,
> >
> > I saw [2] and have been expecting something from cloudflare ever since.
> > Nice to see this submission.
> >
> > There is useful context in your 0th patch header.  I'll likely merge
> > parts of this patch header with the more terse 1/1 header (reality is
> > there only needed to be a single patch submission).
> >
> > Will review and stage accordingly if all looks fine to me.  Mikulas,
> > please have a look too.
> >
> > Thanks,
> > Mike
>
> +   if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) {
> +   if (in_irq()) {
> +   /* Crypto API will fail hard in hard IRQ context */
> +   tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, 
> (unsigned long)&io->work);
> +   tasklet_schedule(&io->tasklet);
> +   } else
> +   kcryptd_crypt(&io->work);
> +   } else {
> +   INIT_WORK(&io->work, kcryptd_crypt);
> +   queue_work(cc->crypt_queue, &io->work);
> +   }
>
> I'm looking at this and I'd like to know why does the crypto API fail in
> hard-irq context and why does it work in tasklet context. What's the exact
> reason behind this?

This comes from
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/crypto/skcipher.c?id=5e857ce6eae7ca21b2055cca4885545e29228fe2#n433
And, I believe, it is there for the same reason kcryptd was introduced
in 2005 in dm-crypt:
"...because it would be very unwise to do decryption in an interrupt
context..." (that is, when other interrupts are disabled). In tasklet
however we can still service other interrupts even if we process a
large block.

>
>
> Mikulas

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-20 Thread Ignat Korchagin

Yes, it should.

I got one when I was testing the first iteration (without the tasklet)
of the patch on an NVME? disk.

On Sat, Jun 20, 2020 at 8:36 PM Mikulas Patocka  wrote:
>
>
>
> On Sat, 20 Jun 2020, Herbert Xu wrote:
>
> > On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote:
> > >
> > > I'm looking at this and I'd like to know why does the crypto API fail in
> > > hard-irq context and why does it work in tasklet context. What's the exact
> > > reason behind this?
> >
> > You're not supposed to do any real work in IRQ handlers.  All
> > the substantial work should be postponed to softirq context.
>
> I see.
>
> BTW - should it also warn if it is running in a process context with
> interrupts disabled?
>
> Mikulas
>
> > Why do you need to do work in hard IRQ context?
> >
> > Cheers,
> > --
> > Email: Herbert Xu 
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> >
>

Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-22 Thread Ignat Korchagin

On Mon, Jun 22, 2020 at 1:45 AM Damien Le Moal  wrote:
>
> On 2020/06/20 1:56, Mike Snitzer wrote:
> > On Fri, Jun 19 2020 at 12:41pm -0400,
> > Ignat Korchagin  wrote:
> >
> >> This is a follow up from the long-forgotten [1], but with some more 
> >> convincing
> >> evidence. Consider the following script:
> >>
> >> #!/bin/bash -e
> >>
> >> # create 4G ramdisk
> >> sudo modprobe brd rd_nr=1 rd_size=4194304
> >>
> >> # create a dm-crypt device with NULL cipher on top of /dev/ram0
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo 
> >> dmsetup create eram0
> >>
> >> # create a dm-crypt device with NULL cipher and custom force_inline flag
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> >> force_inline' | sudo dmsetup create inline-eram0
> >>
> >> # read all data from /dev/ram0
> >> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/eram0
> >> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/inline-eram0
> >> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
> >>
> >> This script creates a ramdisk (to eliminate hardware bias in the 
> >> benchmark) and
> >> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
> >> to eliminate potentially expensive crypto bias (the NULL cipher just uses 
> >> memcpy
> >> for "encyption"). The first instance is the current dm-crypt 
> >> implementation from
> >> 5.8-rc1, the second is the dm-crypt instance with a custom new flag 
> >> enabled from
> >> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 
> >> cores
> >> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted 
> >> for
> >> better readability):
> >>
> >> # plain ram0
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # eram0 (current dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # inline-eram0 (patched dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> As we can see, current dm-crypt implementation creates a significant IO
> >> performance overhead (at least on small IO block sizes) for both latency 
> >> and
> >> throughput. We suspect offloading IO request processing into workqueues and
> >> async threads is more harmful these days with the modern fast storage. I 
> >> also
> >> did some digging into the dm-crypt git history and much of this async 
> >> processing
> >> is not needed anymore, because the reasons it was added are mostly gone 
> >> from the
> >> kernel. More details can be found in [2] (see "Git archeology" section).
> >>
> >> We have been running the attached patch on different hardware generations 
> >> in
> >> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were 
> >> very
> >> happy with the performance benefits.
> >>
> >> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> >>
> >> Ignat Korchagin (1):
> >>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> >>
> >>  drivers/md/dm-crypt.c | 55 +--
> >>  1 file changed, 43 insertions(+), 12 deletions(-)
> >>
> >> --
> >> 2.20.1
> >>
> >
> > Hi,
> >
> > I saw [2] and have been expecting something from cloudflare ever since.
> > Nice to see this submission.
> >
> > There is useful context in your 0th patch header.  I'll likely merge
> > parts of this patch header with the more terse 1/1 header (reality is
> > there only needed to be a single patch submission).
> >
> > Will review

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-24 Thread Ignat Korchagin

On Wed, Jun 24, 2020 at 6:22 AM Mike Snitzer  wrote:
>
> On Wed, Jun 24 2020 at 12:54am -0400,
> Damien Le Moal  wrote:
>
> > On 2020/06/24 0:23, Mike Snitzer wrote:
> > > On Tue, Jun 23 2020 at 11:07am -0400,
> > > Ignat Korchagin  wrote:
> > >
> > >> Do you think it may be better to break it in two flags: one for read
> > >> path and one for write? So, depending on the needs and workflow these
> > >> could be enabled independently?
> > >
> > > If there is a need to split, then sure.  But I think Damien had a hard
> > > requirement that writes had to be inlined but that reads didn't _need_
> > > to be for his dm-zoned usecase.  Damien may not yet have assessed the
> > > performance implications, of not have reads inlined, as much as you
> > > have.
> >
> > We did do performance testing :)
> > The results are mixed and performance differences between inline vs 
> > workqueues
> > depend on the workload (IO size, IO queue depth and number of drives being 
> > used
> > mostly). In many cases, inlining everything does really improve performance 
> > as
> > Ignat reported.
> >
> > In our testing, we used hard drives and so focused mostly on throughput 
> > rather
> > than command latency. The added workqueue context switch overhead and crypto
> > work latency compared to typical HDD IO times is small, and significant 
> > only if
> > the backend storage as short IO times.
> >
> > In the case of HDDs, especially for large IO sizes, inlining crypto work 
> > does
> > not shine as it prevents an efficient use of CPU resources. This is 
> > especially
> > true with reads on a large system with many drives connected to a single 
> > HBA:
> > the softirq context decryption work does not lend itself well to using other
> > CPUs that did not receive the HBA IRQ signaling command completions. The 
> > test
> > results clearly show much higher throughputs using dm-crypt as is.
> >
> > On the other hand, inlining crypto work significantly improves workloads of
> > small random IOs, even for a large number of disks: removing the overhead of
> > context switches allows faster completions, allowing sending more requests 
> > to
> > the drives more quickly, keeping them busy.
> >
> > For SMR, the inlining of write requests is *mandatory* to preserve the 
> > issuer
> > write sequence, but encryption work being done in the issuer context 
> > (writes to
> > SMR drives can only be O_DIRECT writes), efficient CPU resource usage can be
> > achieved by simply using multiple writer thread/processes, working on 
> > different
> > zones of different disks. This is a very reasonable model for SMR as writes 
> > into
> > a single zone have to be done under mutual exclusion to ensure 
> > sequentiality.
> >
> > For reads, SMR drives are essentially exactly the same as regular disks, so
> > as-is or inline are both OK. Based on our performance results, allowing the 
> > user
> > to have the choice of inlining or not reads based on the target workload 
> > would
> > be great.
> >
> > Of note is that zone append writes (emulated in SCSI, native with NVMe) are 
> > not
> > subject to the sequential write constraint, so they can also be executed 
> > either
> > inline or asynchronously.
> >
> > > So let's see how Damien's work goes and if he trully doesn't need/want
> > > reads to be inlined then 2 flags can be created.
> >
> > For SMR, I do not need inline reads, but I do want the user to have the
> > possibility of using this setup as that can provide better performance for 
> > some
> > workloads. I think that splitting the inline flag in 2 is exactly what we 
> > want:
> >
> > 1) For SMR, the write-inline flag can be automatically turned on when the 
> > target
> > device is created if the backend device used is a host-managed zoned drive 
> > (scsi
> > or NVMe ZNS). For reads, it would be the user choice, based on the target 
> > workload.
> > 2) For regular block devices, write-inline only, read-inline only or both 
> > would
> > be the user choice, to optimize for their target workload.
> >
> > With the split into 2 flags, my SMR support patch becomes very simple.
>
> OK, thanks for all the context.  Was a fun read ;)
>
> SO let's run with splitting into 2 flags.  Ignat would you be up to
> tweaking your patch to provide that and post a v2?
>
> An added bonus would be to consolidate your 0/1 and 1/1 patch headers,
> and add in the additional answers you provided in this thread to help
> others understand the patch (mainly some more detail about why tasklet
> is used).

Yes, will do

> Thanks,
> Mike
>

Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

2020-06-24 Thread Ignat Korchagin

On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers  wrote:
>
> On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. 
> > This is
> > especially visible on busy systems with many processes/threads. Moreover, 
> > most
> > Crypto API implementaions are async, that is they offload crypto operations 
> > on
> > their own, so this dm-crypt offloading is excessive.
>
> This really should say "some Crypto API implementations are async" instead of
> "most Crypto API implementations are async".

The most accurate would probably be: most hardware-accelerated Crypto
API implementations are async

> Notably, the AES-NI implementation of AES-XTS is synchronous if you call it 
> in a
> context where SIMD instructions are usable.  It's only asynchronous when SIMD 
> is
> not usable.  (This seems to have been missed in your blog post.)

No, it was not. This is exactly why we made xts-proxy Crypto API
module as a second patch. But it seems now it does not make a big
difference if a used Crypto API implementation is synchronous as well
(based on some benchmarks outlined in the cover letter to this patch).
I think the v2 of this patch will not require a synchronous Crypto
API. This is probably a right thing to do, as the "inline" flag should
control the way how dm-crypt itself handles requests, not how Crypto
API handles requests. If a user wants to ensure a particular
synchronous Crypto API implementation, they can already reconfigure
dm-crypt and specify the implementation with a "capi:" prefix in the
the dm table description.

> > This adds a new flag, which directs dm-crypt not to offload crypto 
> > operations
> > and process everything inline. For cases, where crypto operations cannot 
> > happen
> > inline (hard interrupt context, for example the read path of the NVME 
> > driver),
> > we offload the work to a tasklet rather than a workqueue.
>
> This patch both removes some dm-crypt specific queueing, and changes 
> decryption
> to use softIRQ context instead of a workqueue.  It would be useful to know how
> much of a difference the workqueue => softIRQ change makes by itself.  Such a
> change could be useful for fscrypt as well.  (fscrypt uses a workqueue for
> decryption, but besides that doesn't use any other queueing.)
>
> > @@ -127,7 +128,7 @@ struct iv_elephant_private {
> >   * and encrypts / decrypts at the same time.
> >   */
> >  enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
> > -  DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
> > +  DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = 
> > (sizeof(unsigned long) * 8 - 1) };
>
> Assigning a specific enum value isn't necessary.

Yes, this is a leftover from our "internal" patch which I wanted to
make "future proof" in case future iterations of dm-crypt will add
some flags to avoid flag collisions. Will remove in v2.

>
> > @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct 
> > crypt_config *cc,
> >
> >   skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
> >
> > - /*
> > -  * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> > -  * requests if driver request queue is full.
> > -  */
> > - skcipher_request_set_callback(ctx->r.req,
> > - CRYPTO_TFM_REQ_MAY_BACKLOG,
> > - kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> > + if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
> > + /* make sure we zero important fields of the request */
> > + skcipher_request_set_callback(ctx->r.req,
> > + 0, NULL, NULL);
> > + else
> > + /*
> > +  * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> > +  * requests if driver request queue is full.
> > +  */
> > + skcipher_request_set_callback(ctx->r.req,
> > + CRYPTO_TFM_REQ_MAY_BACKLOG,
> > + kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> >  }
>
> This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
> crypto_alloc_skcipher(), the skcipher implementation can still be 
> asynchronous,
> in which case providing a callback is required.
>
> Do you intend that the "force_inline" option forces the use of a synchronous
> skcipher (alongside the other things it does)?  Or should it still allow
> asynchronous ones?

As mentioned above, I don't think we should require synchronous crypto
with the "force_inline" flag anymore. Although we may remove
CRYPTO_TFM_REQ_MAY_BACKLOG with the inline flag.

> We may not actually have a choice in that matter, since xts-aes-aesni has the
> CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
> cases; thus, the crypto API won't give you it if you ask for a synchronous
> cipher.  So I think you still need to allow async skciphers?  That means a
> callback is still always required.
>
> - Eric

Re: [PATCH v2 5/8] net: af_can: do not leave a dangling sk pointer in can_create()

2024-10-08 Thread Ignat Korchagin

On Tue, Oct 8, 2024 at 3:38 AM Vincent MAILHOL
 wrote:
>
> Hi Ignat,
>
> Thanks for the patch.
>
> On Tue. 8 Oct. 2024 at 06:37, Ignat Korchagin  wrote:
> > On error can_create() frees the allocated sk object, but sock_init_data()
> > has already attached it to the provided sock object. This will leave a
> > dangling sk pointer in the sock object and may cause use-after-free later.
>
> I was about to suggest that this should be backported to stable, but
> after reading the cover letter, I now understand that this patch is
> more an improvement to avoid false positives on kmemleak & Cie. Maybe
> the description could be a bit more nuanced? After patch 1/8 of this
> series, it seems that the use-after-free is not possible anymore.

If we go with Kuniyuki's suggestion in the cover email to replace the
explicit NULL with DEBUG_NET_WARN_ON_ONCE(sock->sk) in __sock_create()
then use-after-free would be possible again except we will easily
catch it. But I will review the description, when I respin the patches
to net-next as requested by Jakub.

> > Signed-off-by: Ignat Korchagin 
>
> See above comment as notwithstanding. This said:
>
> Reviewed-by: Vincent Mailhol 

Thank you.

>
> > ---
> >  net/can/af_can.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/can/af_can.c b/net/can/af_can.c
> > index 707576eeeb58..01f3fbb3b67d 100644
> > --- a/net/can/af_can.c
> > +++ b/net/can/af_can.c
> > @@ -171,6 +171,7 @@ static int can_create(struct net *net, struct socket 
> > *sock, int protocol,
> > /* release sk on errors */
> > sock_orphan(sk);
> > sock_put(sk);
> > +   sock->sk = NULL;
> > }
> >
> >   errout:
> > --
> > 2.39.5
> >
> >

[PATCH v2 8/8] inet6: do not leave a dangling sk pointer in inet6_create()

2024-10-07 Thread Ignat Korchagin

sock_init_data() attaches the allocated sk pointer to the provided sock
object. If inet6_create() fails later, the sk object is released, but the
sock object retains the dangling sk pointer, which may cause use-after-free
later.

Clear the sock sk pointer on error.

Signed-off-by: Ignat Korchagin 
---
 net/ipv6/af_inet6.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index ba69b86f1c7d..f60ec8b0f8ea 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -252,31 +252,29 @@ static int inet6_create(struct net *net, struct socket 
*sock, int protocol,
 */
inet->inet_sport = htons(inet->inet_num);
err = sk->sk_prot->hash(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
if (sk->sk_prot->init) {
err = sk->sk_prot->init(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 
if (!kern) {
err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 out:
return err;
 out_rcu_unlock:
rcu_read_unlock();
goto out;
+out_sk_release:
+   sk_common_release(sk);
+   sock->sk = NULL;
+   goto out;
 }
 
 static int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
-- 
2.39.5

[PATCH v2 1/8] net: explicitly clear the sk pointer, when pf->create fails

2024-10-07 Thread Ignat Korchagin

We have recently noticed the exact same KASAN splat as in commit
6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket
creation fails"). The problem is that commit did not fully address the
problem, as some pf->create implementations do not use sk_common_release
in their error paths.

For example, we can use the same reproducer as in the above commit, but
changing ping to arping. arping uses AF_PACKET socket and if packet_create
fails, it will just sk_free the allocated sk object.

While we could chase all the pf->create implementations and make sure they
NULL the freed sk object on error from the socket, we can't guarantee
future protocols will not make the same mistake.

So it is easier to just explicitly NULL the sk pointer upon return from
pf->create in __sock_create. We do know that pf->create always releases the
allocated sk object on error, so if the pointer is not NULL, it is
definitely dangling.

Fixes: 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket 
creation fails")
Signed-off-by: Ignat Korchagin 
Cc: sta...@vger.kernel.org
---
 net/core/sock.c | 3 ---
 net/socket.c| 7 ++-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 039be95c40cf..e6e04081949c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3819,9 +3819,6 @@ void sk_common_release(struct sock *sk)
 
sk->sk_prot->unhash(sk);
 
-   if (sk->sk_socket)
-   sk->sk_socket->sk = NULL;
-
/*
 * In this point socket cannot receive new packets, but it is possible
 * that some packets are in flight because some CPU runs receiver and
diff --git a/net/socket.c b/net/socket.c
index 601ad74930ef..042451f01c65 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1574,8 +1574,13 @@ int __sock_create(struct net *net, int family, int type, 
int protocol,
rcu_read_unlock();
 
err = pf->create(net, sock, protocol, kern);
-   if (err < 0)
+   if (err < 0) {
+   /* ->create should release the allocated sock->sk object on 
error
+* but it may leave the dangling pointer
+*/
+   sock->sk = NULL;
goto out_module_put;
+   }
 
/*
 * Now to bump the refcnt of the [loadable] module that owns this
-- 
2.39.5

[PATCH v2 0/8] do not leave dangling sk pointers in pf->create functions

2024-10-07 Thread Ignat Korchagin

Some protocol family create() implementations have an error path after
allocating the sk object and calling sock_init_data(). sock_init_data()
attaches the allocated sk object to the sock object, provided by the
caller.

If the create() implementation errors out after calling sock_init_data(),
it releases the allocated sk object, but the caller ends up having a
dangling sk pointer in its sock object on return. Subsequent manipulations
on this sock object may try to access the sk pointer, because it is not
NULL thus creating a use-after-free scenario.

While the first patch in the series should be enough to handle this
scenario Eric Dumazet suggested that it would be a good idea to refactor
the code for the af_packet implementation to avoid the error path, which
leaves a dangling pointer, because it may be better for some tools like
kmemleak. I went a bit further and tried to actually fix all the
implementations, which could potentially leave a dangling sk pointer.

Changes in V2:
  * reverted the change introduced in 6cd4a78d962b ("net: do not leave a
dangling sk pointer, when socket creation fails")
  * added optional commits to all pf->create implementaions to clear the
sk pointer on error after sock_init_data()

Ignat Korchagin (8):
  net: explicitly clear the sk pointer, when pf->create fails
  af_packet: avoid erroring out after sock_init_data() in
packet_create()
  Bluetooth: L2CAP: do not leave dangling sk pointer on error in
l2cap_sock_create()
  Bluetooth: RFCOMM: avoid leaving dangling sk pointer in
rfcomm_sock_alloc()
  net: af_can: do not leave a dangling sk pointer in can_create()
  net: ieee802154: do not leave a dangling sk pointer in
ieee802154_create()
  net: inet: do not leave a dangling sk pointer in inet_create()
  inet6: do not leave a dangling sk pointer in inet6_create()

 net/bluetooth/l2cap_sock.c  |  1 +
 net/bluetooth/rfcomm/sock.c | 10 +-
 net/can/af_can.c|  1 +
 net/core/sock.c |  3 ---
 net/ieee802154/socket.c | 12 +++-
 net/ipv4/af_inet.c  | 22 ++
 net/ipv6/af_inet6.c | 22 ++
 net/packet/af_packet.c  | 12 ++--
 net/socket.c|  7 ++-
 9 files changed, 46 insertions(+), 44 deletions(-)

-- 
2.39.5

[PATCH v2 2/8] af_packet: avoid erroring out after sock_init_data() in packet_create()

2024-10-07 Thread Ignat Korchagin

After sock_init_data() the allocated sk object is attached to the provided
sock object. On error, packet_create() frees the sk object leaving the
dangling pointer in the sock object on return. Some other code may try
to use this pointer and cause use-after-free.

Suggested-by: Eric Dumazet 
Signed-off-by: Ignat Korchagin 
---
 net/packet/af_packet.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index a705ec214254..97774bd4b6cb 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3421,17 +3421,17 @@ static int packet_create(struct net *net, struct socket 
*sock, int protocol,
if (sock->type == SOCK_PACKET)
sock->ops = &packet_ops_spkt;
 
+   po = pkt_sk(sk);
+   err = packet_alloc_pending(po);
+   if (err)
+   goto out_sk_free;
+
sock_init_data(sock, sk);
 
-   po = pkt_sk(sk);
init_completion(&po->skb_completion);
sk->sk_family = PF_PACKET;
po->num = proto;
 
-   err = packet_alloc_pending(po);
-   if (err)
-   goto out2;
-
packet_cached_dev_reset(po);
 
sk->sk_destruct = packet_sock_destruct;
@@ -3463,7 +3463,7 @@ static int packet_create(struct net *net, struct socket 
*sock, int protocol,
sock_prot_inuse_add(net, &packet_proto, 1);
 
return 0;
-out2:
+out_sk_free:
sk_free(sk);
 out:
return err;
-- 
2.39.5

[PATCH v2 4/8] Bluetooth: RFCOMM: avoid leaving dangling sk pointer in rfcomm_sock_alloc()

2024-10-07 Thread Ignat Korchagin

bt_sock_alloc() attaches allocated sk object to the provided sock object.
If rfcomm_dlc_alloc() fails, we release the sk object, but leave the
dangling pointer in the sock object, which may cause use-after-free.

Fix this by swapping calls to bt_sock_alloc() and rfcomm_dlc_alloc().

Signed-off-by: Ignat Korchagin 
---
 net/bluetooth/rfcomm/sock.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
index 37d63d768afb..0d0c4311da57 100644
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -274,13 +274,13 @@ static struct sock *rfcomm_sock_alloc(struct net *net, 
struct socket *sock,
struct rfcomm_dlc *d;
struct sock *sk;
 
-   sk = bt_sock_alloc(net, sock, &rfcomm_proto, proto, prio, kern);
-   if (!sk)
+   d = rfcomm_dlc_alloc(prio);
+   if (!d)
return NULL;
 
-   d = rfcomm_dlc_alloc(prio);
-   if (!d) {
-   sk_free(sk);
+   sk = bt_sock_alloc(net, sock, &rfcomm_proto, proto, prio, kern);
+   if (!sk) {
+   rfcomm_dlc_free(d);
return NULL;
}
 
-- 
2.39.5

[PATCH v2 3/8] Bluetooth: L2CAP: do not leave dangling sk pointer on error in l2cap_sock_create()

2024-10-07 Thread Ignat Korchagin

bt_sock_alloc() allocates the sk object and attaches it to the provided
sock object. On error l2cap_sock_alloc() frees the sk object, but the
dangling pointer is still attached to the sock object, which may create
use-after-free in other code.

Signed-off-by: Ignat Korchagin 
---
 net/bluetooth/l2cap_sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index ba437c6f6ee5..18e89e764f3b 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1886,6 +1886,7 @@ static struct sock *l2cap_sock_alloc(struct net *net, 
struct socket *sock,
chan = l2cap_chan_create();
if (!chan) {
sk_free(sk);
+   sock->sk = NULL;
return NULL;
}
 
-- 
2.39.5

[PATCH v2 7/8] net: inet: do not leave a dangling sk pointer in inet_create()

2024-10-07 Thread Ignat Korchagin

sock_init_data() attaches the allocated sk object to the provided sock
object. If inet_create() fails later, the sk object is freed, but the
sock object retains the dangling pointer, which may create use-after-free
later.

Clear the sk pointer in the sock object on error.

Signed-off-by: Ignat Korchagin 
---
 net/ipv4/af_inet.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index b24d74616637..8095e82de808 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -376,32 +376,30 @@ static int inet_create(struct net *net, struct socket 
*sock, int protocol,
inet->inet_sport = htons(inet->inet_num);
/* Add to protocol hash chains. */
err = sk->sk_prot->hash(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 
if (sk->sk_prot->init) {
err = sk->sk_prot->init(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 
if (!kern) {
err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 out:
return err;
 out_rcu_unlock:
rcu_read_unlock();
goto out;
+out_sk_release:
+   sk_common_release(sk);
+   sock->sk = NULL;
+   goto out;
 }
 
 
-- 
2.39.5

[PATCH v2 5/8] net: af_can: do not leave a dangling sk pointer in can_create()

2024-10-07 Thread Ignat Korchagin

On error can_create() frees the allocated sk object, but sock_init_data()
has already attached it to the provided sock object. This will leave a
dangling sk pointer in the sock object and may cause use-after-free later.

Signed-off-by: Ignat Korchagin 
---
 net/can/af_can.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/can/af_can.c b/net/can/af_can.c
index 707576eeeb58..01f3fbb3b67d 100644
--- a/net/can/af_can.c
+++ b/net/can/af_can.c
@@ -171,6 +171,7 @@ static int can_create(struct net *net, struct socket *sock, 
int protocol,
/* release sk on errors */
sock_orphan(sk);
sock_put(sk);
+   sock->sk = NULL;
}
 
  errout:
-- 
2.39.5

[PATCH v2 6/8] net: ieee802154: do not leave a dangling sk pointer in ieee802154_create()

2024-10-07 Thread Ignat Korchagin

sock_init_data() attaches the allocated sk object to the provided sock
object. If ieee802154_create() fails later, the allocated sk object is
freed, but the dangling pointer remains in the provided sock object, which
may allow use-after-free.

Clear the sk pointer in the sock object on error.

Signed-off-by: Ignat Korchagin 
---
 net/ieee802154/socket.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c
index 990a83455dcf..18d267921bb5 100644
--- a/net/ieee802154/socket.c
+++ b/net/ieee802154/socket.c
@@ -1043,19 +1043,21 @@ static int ieee802154_create(struct net *net, struct 
socket *sock,
 
if (sk->sk_prot->hash) {
rc = sk->sk_prot->hash(sk);
-   if (rc) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (rc)
+   goto out_sk_release;
}
 
if (sk->sk_prot->init) {
rc = sk->sk_prot->init(sk);
if (rc)
-   sk_common_release(sk);
+   goto out_sk_release;
}
 out:
return rc;
+out_sk_release:
+   sk_common_release(sk);
+   sock->sk = NULL;
+   goto out;
 }
 
 static const struct net_proto_family ieee802154_family_ops = {
-- 
2.39.5

[PATCH net-next v3 3/9] Bluetooth: RFCOMM: avoid leaving dangling sk pointer in rfcomm_sock_alloc()

2024-10-14 Thread Ignat Korchagin

bt_sock_alloc() attaches allocated sk object to the provided sock object.
If rfcomm_dlc_alloc() fails, we release the sk object, but leave the
dangling pointer in the sock object, which may cause use-after-free.

Fix this by swapping calls to bt_sock_alloc() and rfcomm_dlc_alloc().

Signed-off-by: Ignat Korchagin 
---
 net/bluetooth/rfcomm/sock.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
index f48250e3f2e1..355e1a1698f5 100644
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -274,13 +274,13 @@ static struct sock *rfcomm_sock_alloc(struct net *net, 
struct socket *sock,
struct rfcomm_dlc *d;
struct sock *sk;
 
-   sk = bt_sock_alloc(net, sock, &rfcomm_proto, proto, prio, kern);
-   if (!sk)
+   d = rfcomm_dlc_alloc(prio);
+   if (!d)
return NULL;
 
-   d = rfcomm_dlc_alloc(prio);
-   if (!d) {
-   sk_free(sk);
+   sk = bt_sock_alloc(net, sock, &rfcomm_proto, proto, prio, kern);
+   if (!sk) {
+   rfcomm_dlc_free(d);
return NULL;
}
 
-- 
2.39.5

[PATCH net-next v3 2/9] Bluetooth: L2CAP: do not leave dangling sk pointer on error in l2cap_sock_create()

2024-10-14 Thread Ignat Korchagin

bt_sock_alloc() allocates the sk object and attaches it to the provided
sock object. On error l2cap_sock_alloc() frees the sk object, but the
dangling pointer is still attached to the sock object, which may create
use-after-free in other code.

Signed-off-by: Ignat Korchagin 
---
 net/bluetooth/l2cap_sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index ba437c6f6ee5..18e89e764f3b 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1886,6 +1886,7 @@ static struct sock *l2cap_sock_alloc(struct net *net, 
struct socket *sock,
chan = l2cap_chan_create();
if (!chan) {
sk_free(sk);
+   sock->sk = NULL;
return NULL;
}
 
-- 
2.39.5

[PATCH net-next v3 1/9] af_packet: avoid erroring out after sock_init_data() in packet_create()

2024-10-14 Thread Ignat Korchagin

After sock_init_data() the allocated sk object is attached to the provided
sock object. On error, packet_create() frees the sk object leaving the
dangling pointer in the sock object on return. Some other code may try
to use this pointer and cause use-after-free.

Suggested-by: Eric Dumazet 
Signed-off-by: Ignat Korchagin 
---
 net/packet/af_packet.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f8942062f776..99ae27d1e4dc 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3421,17 +3421,17 @@ static int packet_create(struct net *net, struct socket 
*sock, int protocol,
if (sock->type == SOCK_PACKET)
sock->ops = &packet_ops_spkt;
 
+   po = pkt_sk(sk);
+   err = packet_alloc_pending(po);
+   if (err)
+   goto out_sk_free;
+
sock_init_data(sock, sk);
 
-   po = pkt_sk(sk);
init_completion(&po->skb_completion);
sk->sk_family = PF_PACKET;
po->num = proto;
 
-   err = packet_alloc_pending(po);
-   if (err)
-   goto out2;
-
packet_cached_dev_reset(po);
 
sk->sk_destruct = packet_sock_destruct;
@@ -3463,7 +3463,7 @@ static int packet_create(struct net *net, struct socket 
*sock, int protocol,
sock_prot_inuse_add(net, &packet_proto, 1);
 
return 0;
-out2:
+out_sk_free:
sk_free(sk);
 out:
return err;
-- 
2.39.5

[PATCH net-next v3 4/9] net: af_can: do not leave a dangling sk pointer in can_create()

2024-10-14 Thread Ignat Korchagin

On error can_create() frees the allocated sk object, but sock_init_data()
has already attached it to the provided sock object. This will leave a
dangling sk pointer in the sock object and may cause use-after-free later.

Signed-off-by: Ignat Korchagin 
Reviewed-by: Vincent Mailhol 
---
 net/can/af_can.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/can/af_can.c b/net/can/af_can.c
index 707576eeeb58..01f3fbb3b67d 100644
--- a/net/can/af_can.c
+++ b/net/can/af_can.c
@@ -171,6 +171,7 @@ static int can_create(struct net *net, struct socket *sock, 
int protocol,
/* release sk on errors */
sock_orphan(sk);
sock_put(sk);
+   sock->sk = NULL;
}
 
  errout:
-- 
2.39.5

[PATCH net-next v3 7/9] net: inet6: do not leave a dangling sk pointer in inet6_create()

2024-10-14 Thread Ignat Korchagin

sock_init_data() attaches the allocated sk pointer to the provided sock
object. If inet6_create() fails later, the sk object is released, but the
sock object retains the dangling sk pointer, which may cause use-after-free
later.

Clear the sock sk pointer on error.

Signed-off-by: Ignat Korchagin 
---
 net/ipv6/af_inet6.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index ba69b86f1c7d..f60ec8b0f8ea 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -252,31 +252,29 @@ static int inet6_create(struct net *net, struct socket 
*sock, int protocol,
 */
inet->inet_sport = htons(inet->inet_num);
err = sk->sk_prot->hash(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
if (sk->sk_prot->init) {
err = sk->sk_prot->init(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 
if (!kern) {
err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 out:
return err;
 out_rcu_unlock:
rcu_read_unlock();
goto out;
+out_sk_release:
+   sk_common_release(sk);
+   sock->sk = NULL;
+   goto out;
 }
 
 static int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
-- 
2.39.5

[PATCH net-next v3 8/9] net: warn, if pf->create does not clear sock->sk on error

2024-10-14 Thread Ignat Korchagin

All pf->create implementations have been fixed now to clear sock->sk on
error, when they deallocate the allocated sk object.

Put a warning in place to make sure we don't break this promise in the
future.

Suggested-by: Kuniyuki Iwashima 
Signed-off-by: Ignat Korchagin 
---
 net/socket.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 24b404299015..9a8e4452b9b2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1576,9 +1576,9 @@ int __sock_create(struct net *net, int family, int type, 
int protocol,
err = pf->create(net, sock, protocol, kern);
if (err < 0) {
/* ->create should release the allocated sock->sk object on 
error
-* but it may leave the dangling pointer
+* and make sure sock->sk is set to NULL to avoid use-after-free
 */
-   sock->sk = NULL;
+   DEBUG_NET_WARN_ON_ONCE(sock->sk);
goto out_module_put;
}
 
-- 
2.39.5

[PATCH net-next v3 6/9] net: inet: do not leave a dangling sk pointer in inet_create()

2024-10-14 Thread Ignat Korchagin

sock_init_data() attaches the allocated sk object to the provided sock
object. If inet_create() fails later, the sk object is freed, but the
sock object retains the dangling pointer, which may create use-after-free
later.

Clear the sk pointer in the sock object on error.

Signed-off-by: Ignat Korchagin 
---
 net/ipv4/af_inet.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index b24d74616637..8095e82de808 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -376,32 +376,30 @@ static int inet_create(struct net *net, struct socket 
*sock, int protocol,
inet->inet_sport = htons(inet->inet_num);
/* Add to protocol hash chains. */
err = sk->sk_prot->hash(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 
if (sk->sk_prot->init) {
err = sk->sk_prot->init(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 
if (!kern) {
err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
-   if (err) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (err)
+   goto out_sk_release;
}
 out:
return err;
 out_rcu_unlock:
rcu_read_unlock();
goto out;
+out_sk_release:
+   sk_common_release(sk);
+   sock->sk = NULL;
+   goto out;
 }
 
 
-- 
2.39.5

[PATCH net-next v3 5/9] net: ieee802154: do not leave a dangling sk pointer in ieee802154_create()

2024-10-14 Thread Ignat Korchagin

sock_init_data() attaches the allocated sk object to the provided sock
object. If ieee802154_create() fails later, the allocated sk object is
freed, but the dangling pointer remains in the provided sock object, which
may allow use-after-free.

Clear the sk pointer in the sock object on error.

Signed-off-by: Ignat Korchagin 
Reviewed-by: Miquel Raynal 
---
 net/ieee802154/socket.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c
index 990a83455dcf..18d267921bb5 100644
--- a/net/ieee802154/socket.c
+++ b/net/ieee802154/socket.c
@@ -1043,19 +1043,21 @@ static int ieee802154_create(struct net *net, struct 
socket *sock,
 
if (sk->sk_prot->hash) {
rc = sk->sk_prot->hash(sk);
-   if (rc) {
-   sk_common_release(sk);
-   goto out;
-   }
+   if (rc)
+   goto out_sk_release;
}
 
if (sk->sk_prot->init) {
rc = sk->sk_prot->init(sk);
if (rc)
-   sk_common_release(sk);
+   goto out_sk_release;
}
 out:
return rc;
+out_sk_release:
+   sk_common_release(sk);
+   sock->sk = NULL;
+   goto out;
 }
 
 static const struct net_proto_family ieee802154_family_ops = {
-- 
2.39.5

[PATCH net-next v3 9/9] Revert "net: do not leave a dangling sk pointer, when socket creation fails"

2024-10-14 Thread Ignat Korchagin

This reverts commit 6cd4a78d962bebbaf8beb7d2ead3f34120e3f7b2.

inet/inet6->create() implementations have been fixed to explicitly NULL the
allocated sk object on error.

A warning was put in place to make sure any future changes will not leave
a dangling pointer in pf->create() implementations.

So this code is now redundant.

Suggested-by: Kuniyuki Iwashima 
Signed-off-by: Ignat Korchagin 
---
 net/core/sock.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 083d438d8b6f..a9391cb796a2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3830,9 +3830,6 @@ void sk_common_release(struct sock *sk)
 
sk->sk_prot->unhash(sk);
 
-   if (sk->sk_socket)
-   sk->sk_socket->sk = NULL;
-
/*
 * In this point socket cannot receive new packets, but it is possible
 * that some packets are in flight because some CPU runs receiver and
-- 
2.39.5

[PATCH net-next v3 0/9] do not leave dangling sk pointers in pf->create functions

2024-10-14 Thread Ignat Korchagin

Some protocol family create() implementations have an error path after
allocating the sk object and calling sock_init_data(). sock_init_data()
attaches the allocated sk object to the sock object, provided by the
caller.

If the create() implementation errors out after calling sock_init_data(),
it releases the allocated sk object, but the caller ends up having a
dangling sk pointer in its sock object on return. Subsequent manipulations
on this sock object may try to access the sk pointer, because it is not
NULL thus creating a use-after-free scenario.

We have implemented a stable hotfix in commit 631083143315
("net: explicitly clear the sk pointer, when pf->create fails"), but this
series aims to fix it properly by going through each of the pf->create()
implementations and making sure they all don't return a sock object with
a dangling pointer on error.

Changes in V3:
  * retargeted the series to net-next
  * dropped the hotfix patch, which was merged into net already
  * replaced the hotfix code with DEBUG_NET_WARN_ON_ONCE() to catch future
violations

Changes in V2:
  * reverted the change introduced in 6cd4a78d962b ("net: do not leave a
dangling sk pointer, when socket creation fails")
  * added optional commits to all pf->create implementaions to clear the
sk pointer on error after sock_init_data()

Ignat Korchagin (9):
  af_packet: avoid erroring out after sock_init_data() in
packet_create()
  Bluetooth: L2CAP: do not leave dangling sk pointer on error in
l2cap_sock_create()
  Bluetooth: RFCOMM: avoid leaving dangling sk pointer in
rfcomm_sock_alloc()
  net: af_can: do not leave a dangling sk pointer in can_create()
  net: ieee802154: do not leave a dangling sk pointer in
ieee802154_create()
  net: inet: do not leave a dangling sk pointer in inet_create()
  net: inet6: do not leave a dangling sk pointer in inet6_create()
  net: warn, if pf->create does not clear sock->sk on error
  Revert "net: do not leave a dangling sk pointer, when socket creation
fails"

 net/bluetooth/l2cap_sock.c  |  1 +
 net/bluetooth/rfcomm/sock.c | 10 +-
 net/can/af_can.c|  1 +
 net/core/sock.c |  3 ---
 net/ieee802154/socket.c | 12 +++-
 net/ipv4/af_inet.c  | 22 ++
 net/ipv6/af_inet6.c | 22 ++
 net/packet/af_packet.c  | 12 ++--
 net/socket.c|  4 ++--
 9 files changed, 42 insertions(+), 45 deletions(-)

-- 
2.39.5

73 matches

Mail list logo