I had a chance to dig into this further and think I now know what's going on.
I ran the test on an AMD Seattle system (same as arm-arm-04) and noticed that the closures test (tests/refcount/closures) didn't finish in >150 minutes. Profiling the test using "perf" pointed to g_closure_ref() and g_closure_unref() as taking the bulk of execution time. This confirms Emilio's suspicion in #20. $ perf report # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 3M of event 'cycles:uppp' # Event count (approx.): 1497293182208 # # Overhead Command Shared Object Symbol # ........ ........ .......................... ............................................ # 67.85% closures libgobject-2.0.so.0.5400.2 [.] g_closure_ref 29.10% closures libgobject-2.0.so.0.5400.2 [.] g_closure_unref 2.45% closures libgobject-2.0.so.0.5400.2 [.] g_closure_invoke 0.22% closures closures [.] test_closure 0.11% closures closures [.] g_closure_unref@plt Copying the annotated profile of g_closure_ref(). First column is % samples at instructions as observed by "perf record" │ │ INC_ASSIGN (closure, ref_count, &new_ref_count); 0.20 │ 30: ldr w1, [x19] 0.31 │ str w1, [x29, #32] 1.06 │ ldr w0, [x29, #32] 0.32 │ ldr w0, [x29, #32] 0.24 │ ldr w2, [x29, #32] │ add w0, w0, #0x1 <------- A │ bfxil w2, w0, #0, #15 0.13 │ str w2, [x29, #32] 2.35 │ ldr w0, [x29, #32] 0.46 │ ldr w2, [x29, #32] │ and w0, w0, #0x7fff 54.73│ 5c: ldxr w3, [x19] <------- B │ cmp w3, w1 0.02 │ ↓b.ne 70 0.97 │ stlxr w4, w2, [x19] │ ↑cbnz w4, 5c 38.56│ 70: dmb ish │ ↑b.ne 30 │ g_return_val_if_fail (new_ref_count > 1, NULL); 0.00 │ cmp w0, #0x1 │ ↓b.ls c8 │ │ return closure; ldxr and the dmb instruction together make up ~92% of all the samples. Also the generated code looks quite bad - notice the redundant loads to w0/w2. It's as if the code is compiled with optimisations turned off - which is not the case ( it's compile with -O2 ) The duplicate loads also lead to a large window between incrementing the reference[A] and trying to do a compare and swap starting at [B]. Another thread performing the same accesses between A and B will cause the cache line to bounce between the two cores and also fail the ldxr, restarting the loop. Aside: The x86 assembly for ref/unref also has duplicate loads but it seems to not mind them much. On further inspection, I noticed that the following definition of the GCloure structure. struct _GClosure { /*< private >*/ volatile guint ref_count : 15; /* meta_marshal is not used anymore but must be zero for historical reasons as it was exposed in the G_CLOSURE_N_NOTIFIERS macro */ volatile guint meta_marshal_nouse : 1; volatile guint n_guards : 1; volatile guint n_fnotifiers : 2; /* finalization notifiers */ volatile guint n_inotifiers : 8; /* invalidation notifiers */ volatile guint in_inotify : 1; volatile guint floating : 1; /*< protected >*/ volatile guint derivative_flag : 1; /*< public >*/ volatile guint in_marshal : 1; volatile guint is_invalid : 1; ... } Specifically, the ref_count is defined as a volatile bit field. That coupled with the implementation of increment explains the bad assembly sequence generated for g_closure_ref()/g_closure_unref(). AFAICT, there is no need for the ref_count (or the other bit fields in the structure) to be volatile, especially as they seem to be updated with gcc __sync builtins. I have a few ideas on how to improve the code by re-factoring INC_ASSIGN/DEC_ASSIGN to use specialised versions of CHANGE_FIELD in gobject/gclosure.c. I'll give those a shot and see if they lead to an improvement in performance. Happy to discuss any suggestions or try patches and report back. It might also be worth raising/discussing the findings here with upstream. I am not at all familiar with glib and they might be able to help explain the code structure and have suggestions for improvement. Since I hadn't dug into this enough... ;) I also ran a few iteration of the test on some other ARM64 platforms, Juno R2 and Juno R0 (this one is the same as arm-arm-01). On the R2 the test always finished in a couple of minutes. But on Juno R0, it sometimes finished in a couple of minutes but seemed to get tied up in knots and took 15min (at which point I terminated the test). After some experiments with restricting the cores the test ran on (using taskset), I concluded that when the test ran on the A57s (incidentally the same cores as on the AMD Seattle system) it took a long time but not when it was running on the A53s. So it seems the micro-architecture behaviour combined with the poor code sequence is making this especially painful while other cores seem to be chug through without breaking a sweat.