On Fri, Feb 7, 2025 at 10:31 AM Josh Poimboeuf <jpoim...@kernel.org> wrote: > > On Mon, Jan 27, 2025 at 02:35:26PM +0800, Yafang Shao wrote: > > The atomic replace livepatch mechanism was introduced to handle scenarios > > where we want to unload a specific livepatch without unloading others. > > However, its current implementation has significant shortcomings, making > > it less than ideal in practice. Below are the key downsides: > > > > - It is expensive > > > > During testing with frequent replacements of an old livepatch, random RCU > > warnings were observed: > > > > [19578271.779605] rcu_tasks_wait_gp: rcu_tasks grace period 642409 is > > 10024 jiffies old. > > [19578390.073790] rcu_tasks_wait_gp: rcu_tasks grace period 642417 is > > 10185 jiffies old. > > [19578423.034065] rcu_tasks_wait_gp: rcu_tasks grace period 642421 is > > 10150 jiffies old. > > [19578564.144591] rcu_tasks_wait_gp: rcu_tasks grace period 642449 is > > 10174 jiffies old. > > [19578601.064614] rcu_tasks_wait_gp: rcu_tasks grace period 642453 is > > 10168 jiffies old. > > [19578663.920123] rcu_tasks_wait_gp: rcu_tasks grace period 642469 is > > 10167 jiffies old. > > [19578872.990496] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is > > 10215 jiffies old. > > [19578903.190292] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is > > 40415 jiffies old. > > [19579017.965500] rcu_tasks_wait_gp: rcu_tasks grace period 642577 is > > 10174 jiffies old. > > [19579033.981425] rcu_tasks_wait_gp: rcu_tasks grace period 642581 is > > 10143 jiffies old. > > [19579153.092599] rcu_tasks_wait_gp: rcu_tasks grace period 642625 is > > 10188 jiffies old. > > > > This indicates that atomic replacement can cause performance issues, > > particularly with RCU synchronization under frequent use. > > Why does this happen?
It occurs during the KLP transition. It seems like the KLP transition is taking too long. [20329703.332453] livepatch: enabling patch 'livepatch_61_release6' [20329703.340417] livepatch: 'livepatch_61_release6': starting patching transition [20329715.314215] rcu_tasks_wait_gp: rcu_tasks grace period 1109765 is 10166 jiffies old. [20329737.126207] rcu_tasks_wait_gp: rcu_tasks grace period 1109769 is 10219 jiffies old. [20329752.018236] rcu_tasks_wait_gp: rcu_tasks grace period 1109773 is 10199 jiffies old. [20329754.848036] livepatch: 'livepatch_61_release6': patching complete > > > - Potential Risks During Replacement > > > > One known issue involves replacing livepatched versions of critical > > functions such as do_exit(). During the replacement process, a panic > > might occur, as highlighted in [0]. Other potential risks may also arise > > due to inconsistencies or race conditions during transitions. > > That needs to be fixed. > > > - Temporary Loss of Patching > > > > During the replacement process, the old patch is set to a NOP > > (no-operation) > > before the new patch is fully applied. This creates a window where the > > function temporarily reverts to its original, unpatched state. If the old > > patch fixed a critical issue (e.g., one that prevented a system panic), > > the > > system could become vulnerable to that issue during the transition. > > Are you saying that atomic replace is not atomic? If so, this sounds > like another bug. >From my understanding, there’s a window where the original function is not patched. klp_enable_patch + klp_init_patch + if (patch->replace) klp_add_nops(patch); <<<< set all old patches to nop + __klp_enable_patch + klp_patch_object + klp_patch_func + ops = klp_find_ops(func->old_func); + if (ops) // add the new patch to the func_stack list list_add_rcu(&func->stack_node, &ops->func_stack); klp_ftrace_handler + func = list_first_or_null_rcu(&ops->func_stack, struct klp_func + if (func->nop) goto unlock; + ftrace_regs_set_instruction_pointer(fregs, (unsigned long)func->new_func); Before the new atomic replace patch is added to the func_stack list, the old patch is already set to nop. If klp_ftrace_handler() is triggered at this point, it will effectively do nothing—in other words, it will execute the original function. I might be wrong. -- Regards Yafang