Hi Strager,
I think the solution is for dll's delaylib trampoline to save xmm1 on the stack before calling __delayLoadHelper2. I made a patch which does this, and it fixes the bug for my code.
Thanks very much for taking the time to track down the cause of this problem. and for creating a patch. :-)
See attached patch. I think my patch has two problems: 1. AVX/vmovupd/ymm might not be usable on the target machine, but saving just xmm isn't enough. Should we perform a CPUID check?
This would only work if the dlltool is run on the same machine, or same type of machine, as the target machine. Probably a safer solution would be to add a new command line option to select the extended trampoline. Then it is up to the user to select the correct trampoline type. To be really paranoid if the new option is not enabled and dlltool is running on an x86_64 host, then it could run a CPUID check and if extended registers are available, issue a warning message to the user, reminding them of the possible problem.
2. We store unaligned with vmovupd. Storing aligned with vmovapd would be better. I haven't looked into how to align ymm registers when storing on the stack.
"Better" as in better performance, yes ? I think that in this case safer is more important than faster, so sticking with unaligned moves should be OK.
I'd love to get this bug fixed so others don't spend two days debugging assembly code!
Would you be willing to work on the improvements suggested above and submitting a revised patch ? The catch here is that such a patch would need a copyright assignment from you before we could accept it. The links below should provide more details on this. Cheers Nick https://www.gnu.org/prep/maintain/html_node/Legally-Significant.html#Legally-Significant https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob_plain;f=doc/Copyright/request-assign.future;hb=HEAD