On 08/09/15 09:26, James Greenhalgh wrote:
On Tue, Sep 08, 2015 at 09:21:08AM +0100, James Greenhalgh wrote:
On Mon, Sep 07, 2015 at 02:09:01PM +0100, Alan Lawrence wrote:
On 04/09/15 13:32, James Greenhalgh wrote:
In that case, these should be implemented as inline assembly blocks. As it
stands, the code generation for these intrinsics will be very poor with this
patch applied.

I'm going to hold off OKing this until I see a follow-up to fix the code
generation, either replacing those particular intrinsics with inline asm,
or doing the more comprehensive fix in the back-end.

Thanks,
James

In that case, here is the follow-up now ;). This fixes each of the following
functions to generate a single instruction followed by ret:
   * vld1_dup_f16, vld1q_dup_f16
   * vset_lane_f16, vsetq_lane_f16
   * vget_lane_f16, vgetq_lane_f16
   * For IN of type either float16x4_t or float16x8_t, and constant C:
return (float16x4_t) {in[C], in[C], in[C], in[C]};
   * Similarly,
return (float16x8_t) {in[C], in[C], in[C], in[C], in[C], in[C], in[C], in[C]};
(These correspond intuitively to what one might expect for "vdup_lane_f16",
"vdup_laneq_f16", "vdupq_lane_f16" and "vdupq_laneq_f16" intrinsics,
although such intrinsics do not actually exist.)

This patch does not deal with equivalents to vdup_n_s16 and other intrinsics
that load immediates, rather than using elements of pre-existing vectors.

What is code generation like for these then? if I remeber correctly it
was the vdup_n_f16 implementation that looked most objectionable before.

Ah, I see what you are saying here. You mean: if there were intrinsics
equivalent to vdup_n_s16 (which there are not), then this patch would not
handle them. I was confused as vld1_dup_f16 does not use an element of a
pre-existing vector, and may well load an immediate, but is handled by
your patch.

To be clear: the *immediate* case of this, we do not use at all yet, as HFmode constants are disabled in aarch64_float_const_representable_p - we need to do some mangling to express the floating point value as a binary constant in the assembler output. (See the ARM backend.) That is, we cannot output (say) an HFmode load of 16.0 as the assembler would express 16.0 as a 32-bit float constant; we would instead need to output a load of immediate 0x4400. Instead, we will push the constant out to the constant pool and use a load instruction taking an address.

--Alan

Reply via email to