> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Monday, October 25, 2021 10:54 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; Richard
> Earnshaw <richard.earns...@arm.com>; nd <n...@arm.com>; Marcus
> Shawcroft <marcus.shawcr...@arm.com>
> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants
> and operations
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> Sent: Saturday, October 23, 2021 11:40 AM
> >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> >> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Earnshaw
> >> <richard.earns...@arm.com>; nd <n...@arm.com>; Marcus Shawcroft
> >> <marcus.shawcr...@arm.com>
> >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector
> >> constants and operations
> >>
> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> I'm still a bit sceptical about treating the high-part cost as lower.
> >> >> ISTM that the subreg cases are the ones that are truly “free” and
> >> >> any others should have a normal cost.  So if CSE handled the
> >> >> subreg case itself (to model how the rtx would actually be
> >> >> generated) then
> >> >> aarch64 code would have to do less work.  I imagine that will be
> >> >> true for
> >> other targets as well.
> >> >
> >> > I guess the main problem is that CSE lacks context because it's not
> >> > until after combine that the high part becomes truly "free" when
> >> > pushed
> >> into a high operation.
> >>
> >> Yeah.  And the aarch64 code is just being asked to cost the operation
> >> it's given, which could for example come from an existing
> >> aarch64_simd_mov_from_<mode>high.  I think we should try to ensure
> >> that a aarch64_simd_mov_from_<mode>high followed by some
> arithmetic
> >> on the result is more expensive than the fused operation (when fusing
> >> is possible).
> >>
> >> An analogy might be: if the cost code is given:
> >>
> >>   (add (reg X) (reg Y))
> >>
> >> then, at some later point, the (reg X) might be replaced with a
> >> multiplication, in which case we'd have a MADD operation and the
> >> addition is effectively free.  Something similar would happen if (reg
> >> X) became a shift by a small amount on newer cores, although I guess
> >> then you could argue either that the cost of the add disappears or that
> the cost of the shift disappears.
> >>
> >> But we shouldn't count ADD as free on the basis that it could be
> >> combined with a multiplication or shift in future.  We have to cost
> >> what we're given.  I think the same thing applies to the high part.
> >>
> >> Here we're trying to prevent cse1 from replacing a DUP (lane) with a
> >> MOVI by saying that the DUP is strictly cheaper than the MOVI.
> >> I don't think that's really true though, and the cost tables in the
> >> patch say that DUP is more expensive (rather than less expensive) than
> MOVI.
> >
> > No we're not. The front end has already pushed the constant into each
> > operation that needs it which is the entire problem.
> 
> I think we're talking about different things here.  I'll come to the gimple 
> stuff
> below, but I was talking purely about the effect on the RTL optimisers.  What
> I meant above is that, in the cse1 dumps, the patch leads to changes like:
> 
>  (insn 20 19 21 2 (set (reg:V8QI 96 [ _8 ])
> -        (const_vector:V8QI [
> +        (vec_select:V8QI (reg:V16QI 116)
> +            (parallel:V16QI [
> +                    (const_int 8 [0x8])
> +                    (const_int 9 [0x9])
> +                    (const_int 10 [0xa])
> +                    (const_int 11 [0xb])
> +                    (const_int 12 [0xc])
> +                    (const_int 13 [0xd])
> +                    (const_int 14 [0xe])
> +                    (const_int 15 [0xf])
> +                ]))) "include/arm_neon.h":6477:22 1394
> {aarch64_simd_mov_from_v16qihigh}
> +     (expr_list:REG_EQUAL (const_vector:V8QI [
>                  (const_int 3 [0x3]) repeated x8
> -            ])) "include/arm_neon.h":6477:22 1160 {*aarch64_simd_movv8qi}
> -     (expr_list:REG_DEAD (reg:V16QI 117)
> -        (nil)))
> +            ])
> +        (expr_list:REG_DEAD (reg:V16QI 117)
> +            (nil))))
> 
> The pre-cse1 code is:
> 
> (insn 19 18 20 2 (set (reg:V16QI 117)
>         (const_vector:V16QI [
>                 (const_int 3 [0x3]) repeated x16
>             ])) "include/arm_neon.h":6477:22 1166 {*aarch64_simd_movv16qi}
>      (nil))
> (insn 20 19 21 2 (set (reg:V8QI 96 [ _8 ])
>         (vec_select:V8QI (reg:V16QI 117)
>             (parallel:V16QI [
>                     (const_int 8 [0x8])
>                     (const_int 9 [0x9])
>                     (const_int 10 [0xa])
>                     (const_int 11 [0xb])
>                     (const_int 12 [0xc])
>                     (const_int 13 [0xd])
>                     (const_int 14 [0xe])
>                     (const_int 15 [0xf])
>                 ]))) "include/arm_neon.h":6477:22 1394
> {aarch64_simd_mov_from_v16qihigh}
>      (nil))
> 
> That is, before the patch, we folded insn 19 into insn 20 to get:
> 
> (insn 20 19 21 2 (set (reg:V8QI 96 [ _8 ])
>         (const_vector:V8QI [
>                 (const_int 3 [0x3]) repeated x8
>             ])) "include/arm_neon.h":6477:22 1160 {*aarch64_simd_movv8qi}
>      (expr_list:REG_DEAD (reg:V16QI 117)
>         (nil)))
> 
> After the patch we reject that because:
> 
>   (set (reg:V8QI X) (const_vector:V8QI [3]))
> 
> is costed as a MOVI (cost 4) and the original
> aarch64_simd_mov_from_v16qihigh is costed as zero.  In other words, the
> patch makes the DUP (lane) in the “mov high” strictly cheaper than a
> constant move (MOVI).

Yes, this was done intentionally because as we talked about a month ago there's
no real way to cost this correctly. The use of `X` there determines whether 
it's cheaper
to use the movi over the dup.  The MOVI not only prevent re-use of the value, 
it also
prevents combining into high operations.  All of which is impossible to tell 
currently
in how CSE and costing are done.

This is an unmodified compiler created from last night's trunk 
https://godbolt.org/z/1saTP4xWs

While yes, it did fold movi into the set, reg 19 wasn't dead, so you now 
materialized the constant 3 times

test0:
        ldr     q0, [x0]
        movi    v3.8b, 0x3  <<<< first
        ldr     q2, [x1]
        movi    v5.16b, 0x3 <<< second
        uxtl    v1.8h, v0.8b
        dup     d4, v2.d[1] <<< third
        uxtl2   v0.8h, v0.16b
        umlal   v1.8h, v2.8b, v5.8b
        umlal   v0.8h, v4.8b, v3.8b
        addhn   v0.8b, v1.8h, v0.8h
        str     d0, [x2]
        ret

whilst my patch, generates

test0:
        movi    v2.16b, 0x3 <<< once
        ldr     q0, \[x0\]
        uxtl    v1.8h, v0.8b
        uxtl2   v0.8h, v0.16b
        ldr     q3, \[x1\]
        umlal   v1.8h, v3.8b, v2.8b
        umlal2  v0.8h, v3.16b, v2.16b
        addhn   v0.8b, v1.8h, v0.8h
        str     d0, \[x2\]
        ret

Yes it's not perfect, yes you can end up with a dup instead of two movi's but 
my argument is it's still a step forward
as the perfect solution doesn't seem to be possible at all with the way things 
are currently set up.

> 
> Preventing this fold seems like a key part of being able to match the
> *l2 forms in the testcase, since otherwise the “mov high” disappears and isn't
> available for combining later.

Yes, and by preventing the folding combine should in principle be able to fold 
it back if it wasn't pushed into another
Instruction, but combine does not attempt to touch constants and selects on 
their own. If it did this "regression" would be fixed.

I'm not really quite sure what we're arguing about..  I did think about all 
three possible cases when making this:

https://godbolt.org/z/hjWhWq1v1

Of the three cases the compiler currently only generates something good for 
test2.  Both test1 and test0 are deficient.
The patch doesn't change test2, significantly improves test0 and whether  test1 
is a regression is likely uArch specific.

On Arm Cortex CPUs it is not a regression as a DUP on a SIMD scalar has the 
same throughput and latencies as a MOVI
according to the Arm Performance Software Optimization guides.

So to me this looks like an improvement overall.  And this is where we likely 
disagree?

> 
> > MOVI as I mentioned before is the one case where this is a toss up.
> > But there are far more constants that cannot be created with a movi.
> > A simple example is
> >
> > #include <arm_neon.h>
> >
> > int8x16_t square(int8x16_t full, int8x8_t small) {
> >     int8x16_t cst = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,15};
> >     int8x8_t low = vget_high_s8 (cst);
> >     int8x8_t res1 = vmul_s8 (small, low);
> >     return vaddq_s8 (vmulq_s8 (full, cst), vcombine_s8 (res1, res1));
> > }
> >
> > Where in Gimple we get
> >
> >   <bb 2> [local count: 1073741824]:
> >   _2 = __builtin_aarch64_get_highv16qi ({ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
> > 11, 12,
> 13, 15, 0 });
> >   _4 = _2 * small_3(D);
> >   _6 = full_5(D) * { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 0 };
> >   _7 = __builtin_aarch64_combinev8qi (_4, _4);
> >   _8 = _6 + _7;
> >   return _8;
> >
> > Regardless of what happens to __builtin_aarch64_get_highv16qi nothing
> > will recreate the relationship with cst, whether
> __builtin_aarch64_get_highv16qi is lowered or not, constant prop will still
> push in constants.
> 
> Yeah, constants are (by design) free in gimple.  But that's OK in itself,
> because RTL optimisers have the job of removing any duplicates that end up
> requiring separate moves.  I think we both agree on that.
> 
> E.g. for:
> 
> #include <arm_neon.h>
> 
> void foo(int8x16_t *x) {
>   x[0] = vaddq_s8 (x[0], (int8x16_t) {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15});
>   x[1] = vaddq_s8 (x[1], (int8x16_t) {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15});
> }
> 
> the final gimple is:
> 
>   <bb 2> [local count: 1073741824]:
>   _1 = *x_4(D);
>   _5 = _1 + { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
>   *x_4(D) = _5;
>   _2 = MEM[(int8x16_t *)x_4(D) + 16B];
>   _7 = _2 + { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
>   MEM[(int8x16_t *)x_4(D) + 16B] = _7;
>   return;
> 
> but cse1 removes the duplicated constant even before the patch.

It doesn't for me, again an unmodified compiler:

https://godbolt.org/z/qnvf7496h 

and CSE1 has as the final codegen:

(insn 7 4 8 2 (set (reg:V16QI 99)
        (const_vector:V16QI [
                (const_int 0 [0])
                (const_int 1 [0x1])
                (const_int 2 [0x2])
                (const_int 3 [0x3])
                (const_int 4 [0x4])
                (const_int 5 [0x5])
                (const_int 6 [0x6])
                (const_int 7 [0x7])
                (const_int 8 [0x8])
                (const_int 9 [0x9])
                (const_int 10 [0xa])
                (const_int 11 [0xb])
                (const_int 12 [0xc])
                (const_int 13 [0xd])
                (const_int 15 [0xf])
                (const_int 0 [0])
            ]))

(insn 8 7 9 2 (set (reg:V8QI 92 [ _2 ])
        (const_vector:V8QI [
                (const_int 8 [0x8])
                (const_int 9 [0x9])
                (const_int 10 [0xa])
                (const_int 11 [0xb])
                (const_int 12 [0xc])
                (const_int 13 [0xd])
                (const_int 15 [0xf])
                (const_int 0 [0])
            ]))

(insn 11 10 12 2 (set (reg:V16QI 95 [ _7 ])
        (vec_concat:V16QI (vec_select:V8QI (reg:V16QI 95 [ _7 ])
                (parallel:V16QI [
                        (const_int 0 [0])
                        (const_int 1 [0x1])
                        (const_int 2 [0x2])
                        (const_int 3 [0x3])
                        (const_int 4 [0x4])
                        (const_int 5 [0x5])
                        (const_int 6 [0x6])
                        (const_int 7 [0x7])
                    ]))
            (reg:V8QI 93 [ _4 ])))

So again same constant represented twice, which is reflected in the codegen.

> 
> > This codegen results in us rematerializing the constant twice.
> >
> > square:
> >         adrp    x0, .LC0
> >         ldr     d2, [x0, #:lo12:.LC0]
> >         adrp    x0, .LC1
> >         ldr     q3, [x0, #:lo12:.LC1]
> >         mul     v1.8b, v2.8b, v1.8b
> >         dup     d2, v1.d[0]
> >         ins     v2.d[1], v1.d[0]
> >         mla     v2.16b, v0.16b, v3.16b
> >         mov     v0.16b, v2.16b
> >         ret
> > .LC1:
> >         .byte   0
> >         .byte   1
> >         .byte   2
> >         .byte   3
> >         .byte   4
> >         .byte   5
> >         .byte   6
> >         .byte   7
> >         .byte   8
> >         .byte   9
> >         .byte   10
> >         .byte   11
> >         .byte   12
> >         .byte   13
> >         .byte   15
> >         .byte   0
> >
> > Regardless whether it's pushed into a high operation or not this codegen
> it's still far more expensive to do this codegen.
> 
> The problem is that here, the patch is preventing CSE from first folding the
> RTL equivalent of:
> 
>   _2 = __builtin_aarch64_get_highv16qi ({ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
> 11, 12,
> 13, 15, 0 });
> 
> to the RTL equivalent of:
> 
>   { 8, 9, 10, 11, 12, 13, 15, 0 }
> 
> Preventing the fold keeps two instances of the original (RTL) constant and so
> prompts CSE to remove the duplicate.
> 
> CSE does do the fold without the patch.  Before cse1 we have:
> 
> (insn 7 4 8 2 (set (reg:V16QI 99)
>         (const_vector:V16QI [
>                 (const_int 0 [0])
>                 (const_int 1 [0x1])
>                 (const_int 2 [0x2])
>                 (const_int 3 [0x3])
>                 (const_int 4 [0x4])
>                 (const_int 5 [0x5])
>                 (const_int 6 [0x6])
>                 (const_int 7 [0x7])
>                 (const_int 8 [0x8])
>                 (const_int 9 [0x9])
>                 (const_int 10 [0xa])
>                 (const_int 11 [0xb])
>                 (const_int 12 [0xc])
>                 (const_int 13 [0xd])
>                 (const_int 15 [0xf])
>                 (const_int 0 [0])
>             ])) "include/arm_neon.h":6449:11 1166 {*aarch64_simd_movv16qi}
>      (nil))
> (insn 8 7 9 2 (set (reg:V8QI 92 [ _2 ])
>         (vec_select:V8QI (reg:V16QI 99)
>             (parallel:V16QI [
>                     (const_int 8 [0x8])
>                     (const_int 9 [0x9])
>                     (const_int 10 [0xa])
>                     (const_int 11 [0xb])
>                     (const_int 12 [0xc])
>                     (const_int 13 [0xd])
>                     (const_int 14 [0xe])
>                     (const_int 15 [0xf])
>                 ]))) "include/arm_neon.h":6449:11 1394
> {aarch64_simd_mov_from_v16qihigh}
>      (nil))
> 
> then unpatched cse1 converts insn 8 to:
> 
> (insn 8 7 9 2 (set (reg:V8QI 92 [ _2 ])
>         (const_vector:V8QI [
>                 (const_int 8 [0x8])
>                 (const_int 9 [0x9])
>                 (const_int 10 [0xa])
>                 (const_int 11 [0xb])
>                 (const_int 12 [0xc])
>                 (const_int 13 [0xd])
>                 (const_int 15 [0xf])
>                 (const_int 0 [0])
>             ])) "include/arm_neon.h":6449:11 1160 {*aarch64_simd_movv8qi}
>      (expr_list:REG_DEAD (reg:V16QI 99)
>         (nil)))
> 

But again, the constant is not single use, so CSE keeps multiple copies live.  
My whole
Argument is that CSE should not perform any folding before combine, because it
Simply does not have enough information to do the right thing.

It's still being folded, just not by CSE which is too early. CE2 still folds it 
into

(insn 8 7 9 2 (set (reg:V8QI 92 [ _2 ])
        (const_vector:V8QI [
                (const_int 8 [0x8])
                (const_int 9 [0x9])
                (const_int 10 [0xa])
                (const_int 11 [0xb])
                (const_int 12 [0xc])
                (const_int 13 [0xd])
                (const_int 15 [0xf])
                (const_int 0 [0])
            ]))

But at least RTL optimizers until then knew about the relationship.

My patch generates instead of

square:
        adrp    x0, .LC0
        ldr     d2, [x0, #:lo12:.LC0]
        adrp    x0, .LC1
        ldr     q3, [x0, #:lo12:.LC1]
        mul     v1.8b, v2.8b, v1.8b
        dup     d2, v1.d[0]
        ins     v2.d[1], v1.d[0]
        mla     v2.16b, v0.16b, v3.16b
        mov     v0.16b, v2.16b
        ret
single:
        adrp    x0, .LC0
        ldr     d2, [x0, #:lo12:.LC0]
        mul     v1.8b, v2.8b, v1.8b
        dup     d2, v1.d[0]
        ins     v2.d[1], v1.d[0]
        add     v0.16b, v2.16b, v0.16b
        ret

it gives

square:
        adrp    x0, .LC0
        ldr     q3, [x0, #:lo12:.LC0]
        dup     d2, v3.d[1]
        mul     v1.8b, v2.8b, v1.8b
        dup     d2, v1.d[0]
        ins     v2.d[1], v1.d[0]
        mla     v2.16b, v0.16b, v3.16b
        mov     v0.16b, v2.16b
        ret
single:
        adrp    x0, .LC1
        ldr     d2, [x0, #:lo12:.LC1]
        mul     v1.8b, v2.8b, v1.8b
        dup     d2, v1.d[0]
        ins     v2.d[1], v1.d[0]
        add     v0.16b, v2.16b, v0.16b
        ret

which now just selects the high part instead of doing a full addressing + load.

> so that there are no longer any duplicate constants (as far as the RTL code is
> concerned).  Instead we have one 16-byte constant and one 8-byte constant.
> 
> The patch prevents the fold on insn 8 by making the “mov high”
> strictly cheaper than the constant move, so we keep the “mov high”
> and its 16-byte input.  Keeping the “mov high” means that we do have a
> duplicate constant for CSE to remove.
> 
> What I meant…
> 
> >> Also, if I've understood correctly, it looks like we'd be relying on
> >> the vget_high of a constant remaining unfolded until RTL cse1.
> >> I think it's likely in future that we'd try to fold vget_high at the
> >> gimple level instead, since that could expose more optimisations of a
> >> different kind.  The gimple optimisers would then fold
> >> vget_high(constant) in a similar way to
> >> cse1 does now.
> >>
> >> So perhaps we should continue to allow the vget_high(constant) to be
> >> foloded in cse1 and come up with some way of coping with the folded
> form.
> 
> …here was that, in future, the gimple optimisers might be able to fold the
> vget_high themselves.  For your example, we'd then have:
> 
>   _4 = { 8, 9, 10, 11, 12, 13, 15, 0 } * small_3(D);
>   _6 = full_5(D) * { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 0 };
>   _7 = __builtin_aarch64_combinev8qi (_4, _4);
>   _8 = _6 + _7;
>   return _8;
> 
> In this situation, we'd need to recreate the relationship between { 0, 1, 2, 
> 3, 4,
> 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 0 } and { 8, 9, 10, 11, 12, 13, 15, 0 }.  
> We can't
> ensure that the relationship is never lost.
> 
> The same thing would be true for vget_low.  So a constant like:
> 
>   cst = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 0 }
>   … vget_low* (cst) ..;
>   … vget_high* (cst) …;
> 
> could be folded to two smaller constants:
> 
>   … { 0, 1, 2, 3, 4, 5, 6, 7 } …;
>   … { 8, 9, 10, 11, 12, 13, 15, 0 } …;
> 
> We might then need to recreate the combined form, rather than relying on
> the combined form already existing.

Yes but this is what confuses me. My patch changes it so that CSE1 which is ran
relatively early is able to find the relationship between the two constants.

CSE1 shouldn't do any folding, it doesn't have enough information to do so.
By CSE doing folding it makes it so combine is less efficient.

> 
> > CSE1 doesn't fold it, because for CSE the cost is too high to do so. Which 
> > is
> what this costing was attempting to fix.
> > CSE simply does not touch it. It leaves it as
> >
> > (insn 11 10 12 2 (set (reg:V16QI 95 [ _7 ])
> >         (vec_concat:V16QI (vec_select:V8QI (reg:V16QI 95 [ _7 ])
> >                 (parallel:V16QI [
> >                         (const_int 0 [0])
> >                         (const_int 1 [0x1])
> >                         (const_int 2 [0x2])
> >                         (const_int 3 [0x3])
> >                         (const_int 4 [0x4])
> >                         (const_int 5 [0x5])
> >                         (const_int 6 [0x6])
> >                         (const_int 7 [0x7])
> >                     ]))
> >             (reg:V8QI 93 [ _4 ]))) "":6506:10 1908
> {aarch64_simd_move_hi_quad_v16qi}
> >      (nil))
> > (insn 12 11 13 2 (set (reg:V16QI 102)
> >         (const_vector:V16QI [
> >                 (const_int 0 [0])
> >                 (const_int 1 [0x1])
> >                 (const_int 2 [0x2])
> >                 (const_int 3 [0x3])
> >                 (const_int 4 [0x4])
> >                 (const_int 5 [0x5])
> >                 (const_int 6 [0x6])
> >                 (const_int 7 [0x7])
> >                 (const_int 8 [0x8])
> >                 (const_int 9 [0x9])
> >                 (const_int 10 [0xa])
> >                 (const_int 11 [0xb])
> >                 (const_int 12 [0xc])
> >                 (const_int 13 [0xd])
> >                 (const_int 15 [0xf])
> >                 (const_int 0 [0])
> >             ])) "":1466:14 1166 {*aarch64_simd_movv16qi}
> >      (nil))
> 
> I don't think that's true for the unpatched compiler.  Are you sure this isn't
> the “pre-CSE” part of the dump?  CSE is confusing (to me) in that it prints
> each function twice, once in unoptimised form and later in optimised form.
> 

Yes I'm sure, see all the compiler explorer links above.

> > And I don't see any way to fix this without having Gimple not push
> constants in, which would lead to worse regressions.
> > I can change the patch to cost the high as a dup which fixes this codegen at
> least and has you rematerialize movi.   If that's
> > not acceptable I can drop costing for High entirely then, it's not the main
> thing I am fixing.
> 
> Costing the high as a dup leaves us in the same situation as before the
> patch: the folded V8QI constant is cheaper than the unfolded mov high.

Yes and the dup will reflect that. The argument that it's not the right cost no
longer hold any water in that case.  In particular as I still maintain that is 
too
early to do any constant folding in CSE1 for AArch64.

Whether it's folded or not doesn't make any difference to combine which will
Fold when combinations are possible with the folder version.

So I have yet to see any actual regression.

But again, if I can't get this one fixed, I'm happy to remove the costing for 
high
part select completely and just move on.  My aim here was to fix element 
extracts
and would have been nice to fix the other obviously bad codegen.

Tamar
> 
> Thanks,
> Richard

Reply via email to