When that option is enabled, STACK_BOUNDARY is set to 64. But when you look at arm_expand_prologue, it appears that very little effort is made to respect that alignment. Three specific cases I see are the IS_NESTED case of pushing ip_rtx and, the lack of checking the size of args_to_push, and no attempt to ensure that an even number of registers are saved. But there may well be other cases I haven't found.
I'm not familiar with the ABI of that machine to know how these should be changed. Does anybody who knows the ABI know how to fix this?