On Mon, Jan 27, 2025 at 02:35:26PM +0800, Yafang Shao wrote:
> The atomic replace livepatch mechanism was introduced to handle scenarios
> where we want to unload a specific livepatch without unloading others.
> However, its current implementation has significant shortcomings, making
> it less than ideal in practice. Below are the key downsides:
> 
> - It is expensive
> 
>   During testing with frequent replacements of an old livepatch, random RCU
>   warnings were observed:
> 
>   [19578271.779605] rcu_tasks_wait_gp: rcu_tasks grace period 642409 is 10024 
> jiffies old.
>   [19578390.073790] rcu_tasks_wait_gp: rcu_tasks grace period 642417 is 10185 
> jiffies old.
>   [19578423.034065] rcu_tasks_wait_gp: rcu_tasks grace period 642421 is 10150 
> jiffies old.
>   [19578564.144591] rcu_tasks_wait_gp: rcu_tasks grace period 642449 is 10174 
> jiffies old.
>   [19578601.064614] rcu_tasks_wait_gp: rcu_tasks grace period 642453 is 10168 
> jiffies old.
>   [19578663.920123] rcu_tasks_wait_gp: rcu_tasks grace period 642469 is 10167 
> jiffies old.
>   [19578872.990496] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 10215 
> jiffies old.
>   [19578903.190292] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 40415 
> jiffies old.
>   [19579017.965500] rcu_tasks_wait_gp: rcu_tasks grace period 642577 is 10174 
> jiffies old.
>   [19579033.981425] rcu_tasks_wait_gp: rcu_tasks grace period 642581 is 10143 
> jiffies old.
>   [19579153.092599] rcu_tasks_wait_gp: rcu_tasks grace period 642625 is 10188 
> jiffies old.
>   
>   This indicates that atomic replacement can cause performance issues,
>   particularly with RCU synchronization under frequent use.

Why does this happen?

> - Potential Risks During Replacement 
> 
>   One known issue involves replacing livepatched versions of critical
>   functions such as do_exit(). During the replacement process, a panic
>   might occur, as highlighted in [0]. Other potential risks may also arise
>   due to inconsistencies or race conditions during transitions.

That needs to be fixed.

> - Temporary Loss of Patching 
> 
>   During the replacement process, the old patch is set to a NOP (no-operation)
>   before the new patch is fully applied. This creates a window where the
>   function temporarily reverts to its original, unpatched state. If the old
>   patch fixed a critical issue (e.g., one that prevented a system panic), the
>   system could become vulnerable to that issue during the transition.

Are you saying that atomic replace is not atomic?  If so, this sounds
like another bug.

-- 
Josh

Reply via email to