Le 02/01/2026 à 15:09, Ryan Roberts a écrit :
On 02/01/2026 13:39, Jason A. Donenfeld wrote:
Hi Ryan,

On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <[email protected]> wrote:
context. Given the function is just a handful of operations and doesn't

How many? What's this looking like in terms of assembly?

25 instructions on arm64:

31 instructions on powerpc:

00000000 <prandom_u32_state>:
   0:   7c 69 1b 78     mr      r9,r3
   4:   80 63 00 00     lwz     r3,0(r3)
   8:   80 89 00 08     lwz     r4,8(r9)
   c:   81 69 00 04     lwz     r11,4(r9)
  10:   80 a9 00 0c     lwz     r5,12(r9)
  14:   54 67 30 32     slwi    r7,r3,6
  18:   7c e7 1a 78     xor     r7,r7,r3
  1c:   55 66 10 3a     slwi    r6,r11,2
  20:   54 88 68 24     slwi    r8,r4,13
  24:   54 63 90 18     rlwinm  r3,r3,18,0,12
  28:   7d 6b 32 78     xor     r11,r11,r6
  2c:   7d 08 22 78     xor     r8,r8,r4
  30:   54 aa 18 38     slwi    r10,r5,3
  34:   54 e7 9b 7e     srwi    r7,r7,13
  38:   7c e7 1a 78     xor     r7,r7,r3
  3c:   51 66 2e fe     rlwimi  r6,r11,5,27,31
  40:   54 84 38 28     rlwinm  r4,r4,7,0,20
  44:   7d 4a 2a 78     xor     r10,r10,r5
  48:   55 08 5d 7e     srwi    r8,r8,21
  4c:   7d 08 22 78     xor     r8,r8,r4
  50:   7c e3 32 78     xor     r3,r7,r6
  54:   54 a5 68 16     rlwinm  r5,r5,13,0,11
  58:   55 4a a3 3e     srwi    r10,r10,12
  5c:   7d 4a 2a 78     xor     r10,r10,r5
  60:   7c 63 42 78     xor     r3,r3,r8
  64:   90 e9 00 00     stw     r7,0(r9)
  68:   90 c9 00 04     stw     r6,4(r9)
  6c:   91 09 00 08     stw     r8,8(r9)
  70:   91 49 00 0c     stw     r10,12(r9)
  74:   7c 63 52 78     xor     r3,r3,r10
  78:   4e 80 00 20     blr

Among those, 8 instructions are for reading/writing the state in stack. They of course disappear when inlining.


It'd also be
nice to have some brief analysis of other call sites to have
confirmation this isn't blowing up other users.

I compiled defconfig before and after this patch on arm64 and compared the text
sizes:

$ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after
add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708)
Function                                     old     new   delta
prandom_seed_full_state                      364     932    +568
pick_next_task_fair                         1940    2036     +96
bpf_user_rnd_u32                             104     196     +92
prandom_bytes_state                          204     260     +56
e843419@0f2b_00012d69_e34                      -       8      +8
e843419@0db7_00010ec3_23ec                     -       8      +8
e843419@02cb_00003767_25c                      -       8      +8
bpf_prog_select_runtime                      448     444      -4
e843419@0aa3_0000cfd1_1580                     8       -      -8
e843419@0aa2_0000cfba_147c                     8       -      -8
e843419@075f_00008d8c_184                      8       -      -8
prandom_u32_state                            100       -    -100
Total: Before=19078072, After=19078780, chg +0.00%

So 708 bytes more after inlining. The main cost is prandom_seed_full_state(),
which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we
could turn that into a loop to reduce ~450 bytes overall.

With following change the increase of prandom_seed_full_state() remains reasonnable and performance wise it is a lot better as it avoids the read/write of the state via the stack

diff --git a/lib/random32.c b/lib/random32.c
index 24e7acd9343f6..28a5b109c9018 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -94,17 +94,11 @@ EXPORT_SYMBOL(prandom_bytes_state);

 static void prandom_warmup(struct rnd_state *state)
 {
+       int i;
+
        /* Calling RNG ten times to satisfy recurrence condition */
-       prandom_u32_state(state);
-       prandom_u32_state(state);
-       prandom_u32_state(state);
-       prandom_u32_state(state);
-       prandom_u32_state(state);
-       prandom_u32_state(state);
-       prandom_u32_state(state);
-       prandom_u32_state(state);
-       prandom_u32_state(state);
-       prandom_u32_state(state);
+       for (i = 0; i < 10; i++)
+               prandom_u32_state(state);
 }

 void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state)

The loop is:

 248:   38 e0 00 0a     li      r7,10
 24c:   7c e9 03 a6     mtctr   r7
 250:   55 05 30 32     slwi    r5,r8,6
 254:   55 46 68 24     slwi    r6,r10,13
 258:   55 27 18 38     slwi    r7,r9,3
 25c:   7c a5 42 78     xor     r5,r5,r8
 260:   7c c6 52 78     xor     r6,r6,r10
 264:   7c e7 4a 78     xor     r7,r7,r9
 268:   54 8b 10 3a     slwi    r11,r4,2
 26c:   7d 60 22 78     xor     r0,r11,r4
 270:   54 a5 9b 7e     srwi    r5,r5,13
 274:   55 08 90 18     rlwinm  r8,r8,18,0,12
 278:   54 c6 5d 7e     srwi    r6,r6,21
 27c:   55 4a 38 28     rlwinm  r10,r10,7,0,20
 280:   54 e7 a3 3e     srwi    r7,r7,12
 284:   55 29 68 16     rlwinm  r9,r9,13,0,11
 288:   7d 64 5b 78     mr      r4,r11
 28c:   7c a8 42 78     xor     r8,r5,r8
 290:   7c ca 52 78     xor     r10,r6,r10
 294:   7c e9 4a 78     xor     r9,r7,r9
 298:   50 04 2e fe     rlwimi  r4,r0,5,27,31
 29c:   42 00 ff b4     bdnz    250 <prandom_seed_full_state+0x7c>

Which replaces the 10 calls to prandom_u32_state()

  fc:   91 3f 00 0c     stw     r9,12(r31)
 100:   7f e3 fb 78     mr      r3,r31
 104:   48 00 00 01     bl      104 <prandom_seed_full_state+0x88>
                        104: R_PPC_REL24        prandom_u32_state
 108:   7f e3 fb 78     mr      r3,r31
 10c:   48 00 00 01     bl      10c <prandom_seed_full_state+0x90>
                        10c: R_PPC_REL24        prandom_u32_state
 110:   7f e3 fb 78     mr      r3,r31
 114:   48 00 00 01     bl      114 <prandom_seed_full_state+0x98>
                        114: R_PPC_REL24        prandom_u32_state
 118:   7f e3 fb 78     mr      r3,r31
 11c:   48 00 00 01     bl      11c <prandom_seed_full_state+0xa0>
                        11c: R_PPC_REL24        prandom_u32_state
 120:   7f e3 fb 78     mr      r3,r31
 124:   48 00 00 01     bl      124 <prandom_seed_full_state+0xa8>
                        124: R_PPC_REL24        prandom_u32_state
 128:   7f e3 fb 78     mr      r3,r31
 12c:   48 00 00 01     bl      12c <prandom_seed_full_state+0xb0>
                        12c: R_PPC_REL24        prandom_u32_state
 130:   7f e3 fb 78     mr      r3,r31
 134:   48 00 00 01     bl      134 <prandom_seed_full_state+0xb8>
                        134: R_PPC_REL24        prandom_u32_state
 138:   7f e3 fb 78     mr      r3,r31
 13c:   48 00 00 01     bl      13c <prandom_seed_full_state+0xc0>
                        13c: R_PPC_REL24        prandom_u32_state
 140:   7f e3 fb 78     mr      r3,r31
 144:   48 00 00 01     bl      144 <prandom_seed_full_state+0xc8>
                        144: R_PPC_REL24        prandom_u32_state
 148:   80 01 00 24     lwz     r0,36(r1)
 14c:   7f e3 fb 78     mr      r3,r31
 150:   83 e1 00 1c     lwz     r31,28(r1)
 154:   7c 08 03 a6     mtlr    r0
 158:   38 21 00 20     addi    r1,r1,32
 15c:   48 00 00 00     b       15c <prandom_seed_full_state+0xe0>
                        15c: R_PPC_REL24        prandom_u32_state


So approx the same number of instructions in size, while better performance.

I'm not really sure if 708 is good or bad...

That's in the noise compared to the overall size of vmlinux, but if we change it to a loop we also reduce pressure on the cache.

Christophe

Reply via email to