On Wed, May 20, 2020 at 7:09 PM Rich Felker <dal...@libc.org> wrote: > > On Wed, May 20, 2020 at 12:08:10PM -0400, Rich Felker wrote: > > On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote: > > > The 05/19/2020 22:31, Arnd Bergmann wrote: > > > > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella > > > > <adhemerval.zane...@linaro.org> wrote: > > > > > On 19/05/2020 16:54, Arnd Bergmann wrote: > > > note: i could not reproduce it in qemu-system with these configs: > > > > > > qemu-system-aarch64 + arm64 kernel + compat vdso > > > qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel > > > qemu-system-arm + cpu max + 32bit arm kernel > > > > > > so i think it's something specific to that user's setup > > > (maybe rpi hw bug or gcc miscompiled the vdso or something > > > with that particular linux, i built my own linux 5.6 because > > > i did not know the exact kernel version where the bug was seen) > > > > > > i don't have access to rpi (or other cortex-a53 where i > > > can install my own kernel) so this is as far as i got. > > > > If we have a binary of the kernel that's known to be failing on the > > hardware, it would be useful to dump its vdso and examine the > > disassembly to see if it was miscompiled. > > OK, OP posted it and I think we've solved this. See > https://github.com/richfelker/musl-cross-make/issues/96#issuecomment-631604410
Thanks a lot everyone for figuring this out. > And my analysis: > > <@dalias> see what i just found on the tracker > <@dalias> patch_vdso/vdso_nullpatch_one in arch/arm/kernel/vdso.c patches out > the time32 functions in this case > <@dalias> but not the time64 one > <@dalias> this looks like a real kernel bug that's not hw-specific except > breaking on all hardware where the patching-out is needed > <@dalias> we could possibly work around it by refusing to use the time64 vdso > unless the time32 one is also present > <@dalias> yep > <@dalias> so i think we've solved this. the kernel thought it wasnt using > vdso anymore because it patched it out > <@dalias> but it forgot to patch out the time64 one > <@dalias> so it stopped updating the data needed for vdso to work As you mentioned in the issue tracker, the patching was meant as an optimization and missing it for clock_gettime64 was a mistake but should by itself not have caused incorrect data to be returned. I would assume that there is another bug that leads to clock_gettime64 not entering the syscall fallback path as it should but instead returning bogus data. Here are some more things I found: - From reading the linux-5.6 code that was tested, I see that a condition that leads to patching out the clock_gettime() vdso should also lead to clock_gettime64() falling back to the the syscall after __arch_get_hw_counter() returns an error, but for some reason that does not happen. Presumably the presence of the patching meant that this code path was never much exercised. A missing 45939ce292b4 ("ARM: 8957/1: VDSO: Match ARMv8 timer in cntvct_functional()") would explain the problem, if it happened on linux-5.6-rc7 or earlier. The fix was merged in the final v5.6 though. - The patching may actually be counterproductive because it means that clock_gettime(CLOCK_*COARSE, ...) has to go through the system call when it could just return the time of the last timer tick regardless of the clocksource. - We may get bitten by errata handling on 32-bit kernels running on 64-bit hardware that has errata workaround in arch/arm64 for compat mode but not in native arm kernels. ARM64_ERRATUM_1418040, ARM64_ERRATUM_858921 or SUN50I_ERRATUM_UNKNOWN1 are examples of workaround that are not used on 32-bit kernels running on 64-bit hardware. Arnd