Le 02/01/2026 à 15:09, Ryan Roberts a écrit :
On 02/01/2026 13:39, Jason A. Donenfeld wrote:
Hi Ryan,
On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <[email protected]> wrote:
context. Given the function is just a handful of operations and doesn't
How many? What's this looking like in terms of assembly?
25 instructions on arm64:
31 instructions on powerpc:
00000000 <prandom_u32_state>:
0: 7c 69 1b 78 mr r9,r3
4: 80 63 00 00 lwz r3,0(r3)
8: 80 89 00 08 lwz r4,8(r9)
c: 81 69 00 04 lwz r11,4(r9)
10: 80 a9 00 0c lwz r5,12(r9)
14: 54 67 30 32 slwi r7,r3,6
18: 7c e7 1a 78 xor r7,r7,r3
1c: 55 66 10 3a slwi r6,r11,2
20: 54 88 68 24 slwi r8,r4,13
24: 54 63 90 18 rlwinm r3,r3,18,0,12
28: 7d 6b 32 78 xor r11,r11,r6
2c: 7d 08 22 78 xor r8,r8,r4
30: 54 aa 18 38 slwi r10,r5,3
34: 54 e7 9b 7e srwi r7,r7,13
38: 7c e7 1a 78 xor r7,r7,r3
3c: 51 66 2e fe rlwimi r6,r11,5,27,31
40: 54 84 38 28 rlwinm r4,r4,7,0,20
44: 7d 4a 2a 78 xor r10,r10,r5
48: 55 08 5d 7e srwi r8,r8,21
4c: 7d 08 22 78 xor r8,r8,r4
50: 7c e3 32 78 xor r3,r7,r6
54: 54 a5 68 16 rlwinm r5,r5,13,0,11
58: 55 4a a3 3e srwi r10,r10,12
5c: 7d 4a 2a 78 xor r10,r10,r5
60: 7c 63 42 78 xor r3,r3,r8
64: 90 e9 00 00 stw r7,0(r9)
68: 90 c9 00 04 stw r6,4(r9)
6c: 91 09 00 08 stw r8,8(r9)
70: 91 49 00 0c stw r10,12(r9)
74: 7c 63 52 78 xor r3,r3,r10
78: 4e 80 00 20 blr
Among those, 8 instructions are for reading/writing the state in stack.
They of course disappear when inlining.
It'd also be
nice to have some brief analysis of other call sites to have
confirmation this isn't blowing up other users.
I compiled defconfig before and after this patch on arm64 and compared the text
sizes:
$ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after
add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708)
Function old new delta
prandom_seed_full_state 364 932 +568
pick_next_task_fair 1940 2036 +96
bpf_user_rnd_u32 104 196 +92
prandom_bytes_state 204 260 +56
e843419@0f2b_00012d69_e34 - 8 +8
e843419@0db7_00010ec3_23ec - 8 +8
e843419@02cb_00003767_25c - 8 +8
bpf_prog_select_runtime 448 444 -4
e843419@0aa3_0000cfd1_1580 8 - -8
e843419@0aa2_0000cfba_147c 8 - -8
e843419@075f_00008d8c_184 8 - -8
prandom_u32_state 100 - -100
Total: Before=19078072, After=19078780, chg +0.00%
So 708 bytes more after inlining. The main cost is prandom_seed_full_state(),
which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we
could turn that into a loop to reduce ~450 bytes overall.
With following change the increase of prandom_seed_full_state() remains
reasonnable and performance wise it is a lot better as it avoids the
read/write of the state via the stack
diff --git a/lib/random32.c b/lib/random32.c
index 24e7acd9343f6..28a5b109c9018 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -94,17 +94,11 @@ EXPORT_SYMBOL(prandom_bytes_state);
static void prandom_warmup(struct rnd_state *state)
{
+ int i;
+
/* Calling RNG ten times to satisfy recurrence condition */
- prandom_u32_state(state);
- prandom_u32_state(state);
- prandom_u32_state(state);
- prandom_u32_state(state);
- prandom_u32_state(state);
- prandom_u32_state(state);
- prandom_u32_state(state);
- prandom_u32_state(state);
- prandom_u32_state(state);
- prandom_u32_state(state);
+ for (i = 0; i < 10; i++)
+ prandom_u32_state(state);
}
void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state)
The loop is:
248: 38 e0 00 0a li r7,10
24c: 7c e9 03 a6 mtctr r7
250: 55 05 30 32 slwi r5,r8,6
254: 55 46 68 24 slwi r6,r10,13
258: 55 27 18 38 slwi r7,r9,3
25c: 7c a5 42 78 xor r5,r5,r8
260: 7c c6 52 78 xor r6,r6,r10
264: 7c e7 4a 78 xor r7,r7,r9
268: 54 8b 10 3a slwi r11,r4,2
26c: 7d 60 22 78 xor r0,r11,r4
270: 54 a5 9b 7e srwi r5,r5,13
274: 55 08 90 18 rlwinm r8,r8,18,0,12
278: 54 c6 5d 7e srwi r6,r6,21
27c: 55 4a 38 28 rlwinm r10,r10,7,0,20
280: 54 e7 a3 3e srwi r7,r7,12
284: 55 29 68 16 rlwinm r9,r9,13,0,11
288: 7d 64 5b 78 mr r4,r11
28c: 7c a8 42 78 xor r8,r5,r8
290: 7c ca 52 78 xor r10,r6,r10
294: 7c e9 4a 78 xor r9,r7,r9
298: 50 04 2e fe rlwimi r4,r0,5,27,31
29c: 42 00 ff b4 bdnz 250 <prandom_seed_full_state+0x7c>
Which replaces the 10 calls to prandom_u32_state()
fc: 91 3f 00 0c stw r9,12(r31)
100: 7f e3 fb 78 mr r3,r31
104: 48 00 00 01 bl 104 <prandom_seed_full_state+0x88>
104: R_PPC_REL24 prandom_u32_state
108: 7f e3 fb 78 mr r3,r31
10c: 48 00 00 01 bl 10c <prandom_seed_full_state+0x90>
10c: R_PPC_REL24 prandom_u32_state
110: 7f e3 fb 78 mr r3,r31
114: 48 00 00 01 bl 114 <prandom_seed_full_state+0x98>
114: R_PPC_REL24 prandom_u32_state
118: 7f e3 fb 78 mr r3,r31
11c: 48 00 00 01 bl 11c <prandom_seed_full_state+0xa0>
11c: R_PPC_REL24 prandom_u32_state
120: 7f e3 fb 78 mr r3,r31
124: 48 00 00 01 bl 124 <prandom_seed_full_state+0xa8>
124: R_PPC_REL24 prandom_u32_state
128: 7f e3 fb 78 mr r3,r31
12c: 48 00 00 01 bl 12c <prandom_seed_full_state+0xb0>
12c: R_PPC_REL24 prandom_u32_state
130: 7f e3 fb 78 mr r3,r31
134: 48 00 00 01 bl 134 <prandom_seed_full_state+0xb8>
134: R_PPC_REL24 prandom_u32_state
138: 7f e3 fb 78 mr r3,r31
13c: 48 00 00 01 bl 13c <prandom_seed_full_state+0xc0>
13c: R_PPC_REL24 prandom_u32_state
140: 7f e3 fb 78 mr r3,r31
144: 48 00 00 01 bl 144 <prandom_seed_full_state+0xc8>
144: R_PPC_REL24 prandom_u32_state
148: 80 01 00 24 lwz r0,36(r1)
14c: 7f e3 fb 78 mr r3,r31
150: 83 e1 00 1c lwz r31,28(r1)
154: 7c 08 03 a6 mtlr r0
158: 38 21 00 20 addi r1,r1,32
15c: 48 00 00 00 b 15c <prandom_seed_full_state+0xe0>
15c: R_PPC_REL24 prandom_u32_state
So approx the same number of instructions in size, while better performance.
I'm not really sure if 708 is good or bad...
That's in the noise compared to the overall size of vmlinux, but if we
change it to a loop we also reduce pressure on the cache.
Christophe