Kirill, in an unrelated context I've stumbled across a change of yours from Aug 2014 (revision 213847) where you "extend" the ways of loading zeros into registers. I don't understand why this was done, and the patch submission mail also doesn't give any reason. My point is that simple VEX-encoded vxorps/vxorpd/vpxor with 128-bit register operands ought to be sufficient to zero any width registers, due to the zeroing of the high parts the instructions do. Hence by using EVEX encoded insns it looks like all you do is grow the instruction length by one or two bytes (besides making the source somewhat more complicated to follow). At the very least the shorter variants should be used for -Os imo.
Thanks for any insight, Jan