Re: Preparation for this weeks call

2024-07-24 Thread Thor Preimesberger via Gcc
Sure - we actually already emit json in optinfo-emit-json.cc,  and there
are implementations of json and pretty-printing/dumping it out also. I got
a hacky version of our current raw dump working with json objects, but
using the functions and data structures in tree-dump.* "as is" would
require further processing of the dump's output (and I think some further
modification) in order to make linking related nodes possible - I think, at
least. This seems expensive computationally, so I'm currently
re-implementing the function dump_generic_nodes from tree-pretty-print.cc
so that it emits JSON instead. tree-dump.cc doesn't currently handle all
the current tree codes in dequeue_and_dump. This approach does
unfortunately lead us to having another spot in the code base that needs to
be synced with the tree data structure. (I didn't take a long look at
tree-streamer* to see if anything there would be helpful, but it looks like
we'd have to enumerate over the codes anyways to get JSON output.)

No patch yet - I'll submit one once the JSON dumping is ready, and others
that process it as appropriate. I've been pushing to get as much done
before Wednesday, so I'll have to get around to pushing onto the git fork
what I've done so far tomorrow / later today.

I wanna emphasize that I started a bit late on this since my academic term
didn't end until around a month after the coding period began. We've
already extended the project to accommodate for that, but it places the mid
point review to three weeks in a project that is expected to take 12 weeks.
We can still address the rate of progress if you still feel trepidacious
about the pace of things.

Thor



On Mon, Jul 22, 2024 at 3:26 AM Richard Biener 
wrote:
>
> Hi,
>
> we're having our bi-weekly call this Wednesday;  I'd like to see you
write a
> summary of what you were doing for the first half of the project and post
> that to me and the GCC mailing list.  Please also send, if appropriate, a
> patch that shows what you have done sofar.
>
> Thanks a lot,
> Richard.


Re: How to implement Native TLS for a specific platform?

2024-07-24 Thread Claudiu Zissulescu Ianculescu via Gcc
Hi Julian,

I hope you have Ulrich's document about TLS, if not please google it:
"ELF Handling for Thread-Local Storage - Ulrich Drepper"

In ARC, I used unspec constructions to emit TLS addresses. If you
wanna simplify it, in the legitimzie_tls_address you can only
implement the most general case, namely TLS_GLOBAL_DYNAMIC, and all
the others to fall back to it. For TLS, you need to reserve a tls
register which will hold the tls pointer (in arc is a register holded
by arc_tp_regno variable, but in ur case u can fix it).
U can use the tls examples in the gcc's dejagnu test folder, compile
arc backend and check the output assembly to see how it works.
In case of ARC, the global dynamic model generates two relocs:

   add   r0,pcl,@x@tlsgd# R_ARC_TLS_GD_GOT x
   bl@__tls_get_addr@plt# R_ARC_S21_PCREL_PLT __tls_get_addr
   # Address of x in r0

where __tls_get_addr is a function provided by the OS which will
return the address of variable x in r0 (return reg). You should
already have the PLT reloc, and you need to implement the TLS_GD_GOT
reloc to GOT table. In GOT table you need additionally two relocs:
GOT[n]  R_ARC_TLS_DTPMOD x
GOT[n+1]  R_ARC_TLS_DTPOFF x

I hope this may clarify it a bit, cheers,
Claudiu

On Thu, Jul 18, 2024 at 12:43 PM Julian Waters via Gcc  wrote:
>
> I guess I'll just say what platform I want to implement this for, since the
> roundabout way of talking about it is probably confusing to everyone. It's
> Windows, and hopefully implementing TLS for it should be relatively easier
> since there is only 1 TLS model on Windows
>
> best regards,
> Julian
>
> On Thu, Jul 18, 2024 at 5:39 PM Julian Waters 
> wrote:
>
> > Hi Claudiu,
> >
> > Thanks for the tip, I've since looked at and drawn inspiration from
> > arc.cc. The main issue I have now is how to implement the code in
> > legitimize_tls_address under i386.cc and the corresponding i386.md machine
> > description file to get the following assembly for a TLS read (Assuming
> > that local is the name of the thread local variable, that the last mov
> > depends on the size of the variable, since it would be movq if it was an 8
> > byte variable, that rscratch refers to scratch registers, and that
> > rscratch1 holds the read TLS value at the end of the operation):
> >
> > movl _tls_index(%rip), %rscratch1
> > movq %gs:88, %rscratch2
> > movq (%rscratch2, %rscratch1, 8), %rscratch1
> > movl local@SECREL32(%rscratch1), %rscratch1
> >
> > With some reference from the arc.cc code and another (unofficial) patch
> > for the platform that I want to implement TLS for, I've managed a half
> > finished implementation of TLS, but the final blocker so to speak is my
> > lack of understanding on how the RTL manipulating code in
> > legitimize_tls_address works. If you have any pointers on how to manipulate
> > RTL to get the assembly required as seen above, I would be very much
> > grateful :)
> >
> > best regards,
> > Julian
> >
> > On Tue, Jul 16, 2024 at 8:16 PM Claudiu Zissulescu Ianculescu <
> > claz...@gmail.com> wrote:
> >
> >> Hi Julian,
> >>
> >> You can check how we did it for ARC. In a nutshell, you need to define
> >> HAVS_AS_TLS macro, you need to legitimize the new TLS address and
> >> calls. Please have a look in arc.cc and search for TLS, also use git
> >> blame to see the original patches. Of course, there are different ways
> >> to implement TLS, in ARC is the simplest solution. Also, u need to
> >> hack the assembler, linker and the OS for a full implementation.
> >>
> >> Cheers,
> >> Claudiu
> >>
> >> On Tue, Jul 9, 2024 at 7:14 PM Julian Waters via Gcc 
> >> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > I'm currently trying to implement Native TLS on a platform that gcc uses
> >> > emutls for at the moment, but I can't seem to figure out where and how
> >> to
> >> > implement it. I have a rough idea of the assembly required for TLS on
> >> this
> >> > platform, but I don't know where to plug it in to the compiler to make
> >> it
> >> > work. Could someone point me in the right direction for implementing TLS
> >> > for a platform that doesn't have it implemented at the moment?
> >> >
> >> > I'm aware that I am being vague as to which platform I want to
> >> implement it
> >> > for. It's a platform that is likely low priority in the eyes of most gcc
> >> > maintainers, so I'm deliberately avoiding mentioning what platform it
> >> is so
> >> > I don't get crickets for a reply :)
> >> >
> >> > best regards,
> >> > Julian
> >>
> >


why are these std::set iterators of different type when compiling with -D_GLIBCXX_DEBUG

2024-07-24 Thread Dennis Luehring via Gcc

using latest gcc/STL

-
#include 

using int_set1 = std::set>;
using int_set2 = std::set;

static_assert(std::is_same());
-


the two iterators are equal when not using _GLIBCXX_DEBUG but become
different when using the define?







Re: why are these std::set iterators of different type when compiling with -D_GLIBCXX_DEBUG

2024-07-24 Thread Jonathan Wakely via Gcc
On Wed, 24 Jul 2024 at 11:38, Dennis Luehring via Gcc  wrote:
>
> using latest gcc/STL
>
> -
> #include 
>
> using int_set1 = std::set>;
> using int_set2 = std::set;
>
> static_assert(std::is_same());
> -
>
>
> the two iterators are equal when not using _GLIBCXX_DEBUG but become
> different when using the define?

Your question belongs on either the libstdc++ or gcc-help mailing
list, not here. This list is for discussing development of GCC itself,
not help using it.

The standard says it's unspecified whether those types are the same,
so portable code should not assume they are/aren't the same. I don't
know for sure, but I assume somebody thought that making them
different was helpful to avoid non-portable code.


Re: why are these std::set iterators of different type when compiling with -D_GLIBCXX_DEBUG

2024-07-24 Thread Dennis Luehring via Gcc

Am 24.07.2024 um 12:41 schrieb Jonathan Wakely:

The standard says it's unspecified whether those types are the same,
so portable code should not assume they are/aren't the same. I don't
know for sure, but I assume somebody thought that making them
different was helpful to avoid non-portable code.


thanks - will try the other mailinglist


Re: Preparation for this weeks call

2024-07-24 Thread Richard Biener via Gcc
On Wed, Jul 24, 2024 at 9:59 AM Thor Preimesberger
 wrote:
>
> Sure - we actually already emit json in optinfo-emit-json.cc,  and there are 
> implementations of json and pretty-printing/dumping it out also. I got a 
> hacky version of our current raw dump working with json objects, but using 
> the functions and data structures in tree-dump.* "as is" would require 
> further processing of the dump's output (and I think some further 
> modification) in order to make linking related nodes possible - I think, at 
> least. This seems expensive computationally, so I'm currently re-implementing 
> the function dump_generic_nodes from tree-pretty-print.cc so that it emits 
> JSON instead. tree-dump.cc doesn't currently handle all the current tree 
> codes in dequeue_and_dump. This approach does unfortunately lead us to having 
> another spot in the code base that needs to be synced with the tree data 
> structure. (I didn't take a long look at tree-streamer* to see if anything 
> there would be helpful, but it looks like we'd have to enumerate over the 
> codes anyways to get JSON output.)
>
> No patch yet - I'll submit one once the JSON dumping is ready, and others 
> that process it as appropriate. I've been pushing to get as much done before 
> Wednesday, so I'll have to get around to pushing onto the git fork what I've 
> done so far tomorrow / later today.

I'd like to actually see some of the progress here, even if
incomplete;  even if there's two versions (Tree-dump and
dump_generic_nodes).

> I wanna emphasize that I started a bit late on this since my academic term 
> didn't end until around a month after the coding period began. We've already 
> extended the project to accommodate for that, but it places the mid point 
> review to three weeks in a project that is expected to take 12 weeks. We can 
> still address the rate of progress if you still feel trepidacious about the 
> pace of things.

There are 8 weeks left until final code submission, so we're 1/3rd
into the project.  That would be 4 weeks, but I understand
that it's more like three on your side.  At this point there should be
a clear path forward, thus the design should be set.  It's
a bit unfortunate that there's nothing written down in code or e-mail
from your side that shows progress there - I hope we
can sort this out later today.

Richard.

>
> Thor
>
>
>
> On Mon, Jul 22, 2024 at 3:26 AM Richard Biener  
> wrote:
> >
> > Hi,
> >
> > we're having our bi-weekly call this Wednesday;  I'd like to see you write a
> > summary of what you were doing for the first half of the project and post
> > that to me and the GCC mailing list.  Please also send, if appropriate, a
> > patch that shows what you have done sofar.
> >
> > Thanks a lot,
> > Richard.


Re: Nonbootstrap build with Apple clang broken in gm2

2024-07-24 Thread Gaius Mulley via Gcc
FX Coudert  writes:

> Another quick m2-related question: I am seeing, in a build of GCC
> 14.1.0 on Linux, that flex is called when building with the modula-2
> front-end. It was not the case in previous builds, and the only
> difference is that I added m2 to the languages. Is that systematic? If
> so, the prerequisites page should be amended:
> https://gcc.gnu.org/install/prerequisites.html
>

Ah yes indeed it is systematic.  Many thanks for spotting this - git
pushed GNU flex required

https://gcc.gnu.org/pipermail/gcc-cvs/2024-July/406305.html

regards,
Gaius


Re: Nonbootstrap build with Apple clang broken in gm2

2024-07-24 Thread FX Coudert via Gcc
> Ah yes indeed it is systematic.  Many thanks for spotting this - git
> pushed GNU flex required
> https://gcc.gnu.org/pipermail/gcc-cvs/2024-July/406305.html

Couldn’t the generated files be committed to the tree, so that flex is not 
needed (unless one modifies the source). This is what is done for the other use 
of flex.

FX

Office Hours for the GNU Toolchain on 2024-07-25 at 11am EST5EDT.

2024-07-24 Thread Carlos O'Donell via Gcc
Office Hours for the GNU Toolchain on 2024-07-25 at 11am EST5EDT.

Agenda:
* https://gcc.gnu.org/wiki/OfficeHours#Next

Meeting Link:
* https://bbb.linuxfoundation.org/room/adm-xcb-for-sk6

--
Cheers,
Carlos.



[RISC-V] Combining vfmv and .vv instructions into a .vf instruction

2024-07-24 Thread Artemiy Volkov via Gcc
Hi Juzhe, Demin, Jeff,

This email is intended to continue the discussion started in
https://marc.info/?l=gcc-patches&m=170927452922009&w=2 about combining vfmv.v.f
and vfmxx.vv instructions into the scalar-vector form vfmxx.vf.

There was a mention on that thread of the potential usefulness of the 
late-combine
pass (added last month in
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=792f97b44ffc5e6a967292b3747fd835e99396e7)
in making this transformation. However, when I tried it out with my testcase at
https://godbolt.org/z/o8oPzo7qY, I found it unable to handle these complex
post-split1 patterns for broadcast and vfmacc:

(insn 129 128 130 3 (set (reg:RVVM4SF 168 [ _61 ])
(if_then_else:RVVM4SF (unspec:RVVMF8BI [
(const_vector:RVVMF8BI [
(const_int 1 [0x1]) repeated x16
])
(const_int 16 [0x10])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(vec_duplicate:RVVM4SF (mem:SF (reg:SI 143 [ ivtmp.21 ]) [1 
MEM[(float *)_145]+0 S4 A32]))
(unspec:RVVM4SF [
(reg:SI 0 zero)
] UNSPEC_VUNDEF))) "/app/example.c":19:53 4019 
{*pred_broadcastrvvm4sf_zvfh}
 (nil))
[ ... ]
(insn 131 130 34 3 (set (reg:RVVM4SF 139 [ D__lsm.10 ])
(if_then_else:RVVM4SF (unspec:RVVMF8BI [
(const_vector:RVVMF8BI [
(const_int 1 [0x1]) repeated x16
])
(const_int 16 [0x10])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(const_int 7 [0x7])
(reg:SI 66 vl)
(reg:SI 67 vtype)
(reg:SI 69 frm)
] UNSPEC_VPREDICATE)
(plus:RVVM4SF (mult:RVVM4SF (reg/v:RVVM4SF 135 [ row ])
(reg:RVVM4SF 168 [ _61 ]))
(reg:RVVM4SF 139 [ D__lsm.10 ]))
(unspec:RVVM4SF [
(reg:SI 0 zero)
] UNSPEC_VUNDEF))) "/app/example.c":19:36 15007 
{*pred_mul_addrvvm4sf_undef}
 (nil))

I'm no expert on this, but what's stopping us from adding some vector-scalar
split patterns alongside vector-vector ones in autovec.md to fix this? For
instance, the addition of fma4_scalar insn_and_split like this:

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index d5793ac..bf54d71 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1229,2 +1229,22 @@

+(define_insn_and_split "fma4_scalar"
+  [(set (match_operand:V_VLSF 0 "register_operand")
+(plus:V_VLSF
+ (mult:V_VLSF
+   (vec_duplicate:V_VLSF (match_operand:SF 1 
"direct_broadcast_operand"))
+   (match_operand:V_VLSF 2 "register_operand"))
+ (match_operand:V_VLSF 3 "register_operand")))]
+  "TARGET_VECTOR && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+rtx ops[] = {operands[0], operands[1], operands[2], operands[3],
+ operands[0]};
+riscv_vector::emit_vlmax_insn (code_for_pred_mul_scalar (PLUS, mode),
+  riscv_vector::TERNARY_OP_FRM_DYN, ops);
+DONE;
+  }
+  [(set_attr "type" "vector")])
+
 ;; -

does lead to vfmacc.vf instructions being emitted instead of vfmacc.vv's for the
testcase linked above.

What do you think about this approach to implement this optimization? Am I
missing anything important? Maybe split1 is too early to determine the final
instruction format (.vf vs .vv) and we should strive to recombine during
late-combine2?

Also, is there anyone working on this optimization at the present moment?

Many thanks in advance,
Artemiy


Re: [RISC-V] Combining vfmv and .vv instructions into a .vf instruction

2024-07-24 Thread Jeff Law via Gcc




On 7/24/24 11:25 AM, Artemiy Volkov wrote:

Hi Juzhe, Demin, Jeff,

This email is intended to continue the discussion started in
https://marc.info/?l=gcc-patches&m=170927452922009&w=2 about combining vfmv.v.f
and vfmxx.vv instructions into the scalar-vector form vfmxx.vf.

There was a mention on that thread of the potential usefulness of the 
late-combine
pass (added last month in
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=792f97b44ffc5e6a967292b3747fd835e99396e7)
in making this transformation. However, when I tried it out with my testcase at
https://godbolt.org/z/o8oPzo7qY, I found it unable to handle these complex
post-split1 patterns for broadcast and vfmacc:

(insn 129 128 130 3 (set (reg:RVVM4SF 168 [ _61 ])
 (if_then_else:RVVM4SF (unspec:RVVMF8BI [
 (const_vector:RVVMF8BI [
 (const_int 1 [0x1]) repeated x16
 ])
 (const_int 16 [0x10])
 (const_int 2 [0x2]) repeated x2
 (const_int 0 [0])
 (reg:SI 66 vl)
 (reg:SI 67 vtype)
 ] UNSPEC_VPREDICATE)
 (vec_duplicate:RVVM4SF (mem:SF (reg:SI 143 [ ivtmp.21 ]) [1 
MEM[(float *)_145]+0 S4 A32]))
 (unspec:RVVM4SF [
 (reg:SI 0 zero)
 ] UNSPEC_VUNDEF))) "/app/example.c":19:53 4019 
{*pred_broadcastrvvm4sf_zvfh}
  (nil))
[ ... ]
(insn 131 130 34 3 (set (reg:RVVM4SF 139 [ D__lsm.10 ])
 (if_then_else:RVVM4SF (unspec:RVVMF8BI [
 (const_vector:RVVMF8BI [
 (const_int 1 [0x1]) repeated x16
 ])
 (const_int 16 [0x10])
 (const_int 2 [0x2]) repeated x2
 (const_int 0 [0])
 (const_int 7 [0x7])
 (reg:SI 66 vl)
 (reg:SI 67 vtype)
 (reg:SI 69 frm)
 ] UNSPEC_VPREDICATE)
 (plus:RVVM4SF (mult:RVVM4SF (reg/v:RVVM4SF 135 [ row ])
 (reg:RVVM4SF 168 [ _61 ]))
 (reg:RVVM4SF 139 [ D__lsm.10 ]))
 (unspec:RVVM4SF [
 (reg:SI 0 zero)
 ] UNSPEC_VUNDEF))) "/app/example.c":19:36 15007 
{*pred_mul_addrvvm4sf_undef}
  (nil))

I'm no expert on this, but what's stopping us from adding some vector-scalar
split patterns alongside vector-vector ones in autovec.md to fix this? For
instance, the addition of fma4_scalar insn_and_split like this:

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index d5793ac..bf54d71 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1229,2 +1229,22 @@

+(define_insn_and_split "fma4_scalar"
+  [(set (match_operand:V_VLSF 0 "register_operand")
+(plus:V_VLSF
+ (mult:V_VLSF
+   (vec_duplicate:V_VLSF (match_operand:SF 1 
"direct_broadcast_operand"))
+   (match_operand:V_VLSF 2 "register_operand"))
+ (match_operand:V_VLSF 3 "register_operand")))]
+  "TARGET_VECTOR && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+rtx ops[] = {operands[0], operands[1], operands[2], operands[3],
+ operands[0]};
+riscv_vector::emit_vlmax_insn (code_for_pred_mul_scalar (PLUS, mode),
+  riscv_vector::TERNARY_OP_FRM_DYN, ops);
+DONE;
+  }
+  [(set_attr "type" "vector")])
+
  ;; -

does lead to vfmacc.vf instructions being emitted instead of vfmacc.vv's for the
testcase linked above.

What do you think about this approach to implement this optimization? Am I
missing anything important? Maybe split1 is too early to determine the final
instruction format (.vf vs .vv) and we should strive to recombine during
late-combine2?

Also, is there anyone working on this optimization at the present moment?
Before jumping straight to a new combiner pattern (especially a 
define_insn_and_split), I would want to have a clearer understanding of 
the code before/after instruction combination as well as before/after 
register allocation and reloading.


When I was looking at the results of late-combine it did seem to fairly 
consistently work well for generating .vx and .vf forms rather than .vv, 
so odds are something simple is missing somewhere.




I'm not aware of anyone working on this, except perhaps Demin.  So 
there's ample room to work without stepping on anyone's toes.


jeff




Re: [RISC-V] Combining vfmv and .vv instructions into a .vf instruction

2024-07-24 Thread 钟居哲
I think Demin is working on it. And Robin is reviewer of this stuff.



juzhe.zh...@rivai.ai
 
From: Artemiy Volkov
Date: 2024-07-25 01:25
To: juzhe.zh...@rivai.ai; demin@starfivetech.com; jeffreya...@gmail.com
CC: gcc@gcc.gnu.org
Subject: [RISC-V] Combining vfmv and .vv instructions into a .vf instruction
Hi Juzhe, Demin, Jeff,
 
This email is intended to continue the discussion started in
https://marc.info/?l=gcc-patches&m=170927452922009&w=2 about combining vfmv.v.f
and vfmxx.vv instructions into the scalar-vector form vfmxx.vf.
 
There was a mention on that thread of the potential usefulness of the 
late-combine
pass (added last month in
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=792f97b44ffc5e6a967292b3747fd835e99396e7)
in making this transformation. However, when I tried it out with my testcase at
https://godbolt.org/z/o8oPzo7qY, I found it unable to handle these complex
post-split1 patterns for broadcast and vfmacc:
 
(insn 129 128 130 3 (set (reg:RVVM4SF 168 [ _61 ])
(if_then_else:RVVM4SF (unspec:RVVMF8BI [
(const_vector:RVVMF8BI [
(const_int 1 [0x1]) repeated x16
])
(const_int 16 [0x10])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(vec_duplicate:RVVM4SF (mem:SF (reg:SI 143 [ ivtmp.21 ]) [1 
MEM[(float *)_145]+0 S4 A32]))
(unspec:RVVM4SF [
(reg:SI 0 zero)
] UNSPEC_VUNDEF))) "/app/example.c":19:53 4019 
{*pred_broadcastrvvm4sf_zvfh}
 (nil))
[ ... ]
(insn 131 130 34 3 (set (reg:RVVM4SF 139 [ D__lsm.10 ])
(if_then_else:RVVM4SF (unspec:RVVMF8BI [
(const_vector:RVVMF8BI [
(const_int 1 [0x1]) repeated x16
])
(const_int 16 [0x10])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(const_int 7 [0x7])
(reg:SI 66 vl)
(reg:SI 67 vtype)
(reg:SI 69 frm)
] UNSPEC_VPREDICATE)
(plus:RVVM4SF (mult:RVVM4SF (reg/v:RVVM4SF 135 [ row ])
(reg:RVVM4SF 168 [ _61 ]))
(reg:RVVM4SF 139 [ D__lsm.10 ]))
(unspec:RVVM4SF [
(reg:SI 0 zero)
] UNSPEC_VUNDEF))) "/app/example.c":19:36 15007 
{*pred_mul_addrvvm4sf_undef}
 (nil))
 
I'm no expert on this, but what's stopping us from adding some vector-scalar
split patterns alongside vector-vector ones in autovec.md to fix this? For
instance, the addition of fma4_scalar insn_and_split like this:
 
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index d5793ac..bf54d71 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1229,2 +1229,22 @@
 
+(define_insn_and_split "fma4_scalar"
+  [(set (match_operand:V_VLSF 0 "register_operand")
+(plus:V_VLSF
+ (mult:V_VLSF
+   (vec_duplicate:V_VLSF (match_operand:SF 1 
"direct_broadcast_operand"))
+   (match_operand:V_VLSF 2 "register_operand"))
+ (match_operand:V_VLSF 3 "register_operand")))]
+  "TARGET_VECTOR && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+rtx ops[] = {operands[0], operands[1], operands[2], operands[3],
+ operands[0]};
+riscv_vector::emit_vlmax_insn (code_for_pred_mul_scalar (PLUS, mode),
+  riscv_vector::TERNARY_OP_FRM_DYN, ops);
+DONE;
+  }
+  [(set_attr "type" "vector")])
+
;; -
 
does lead to vfmacc.vf instructions being emitted instead of vfmacc.vv's for the
testcase linked above.
 
What do you think about this approach to implement this optimization? Am I
missing anything important? Maybe split1 is too early to determine the final
instruction format (.vf vs .vv) and we should strive to recombine during
late-combine2?
 
Also, is there anyone working on this optimization at the present moment?
 
Many thanks in advance,
Artemiy
 


Re: Preparation for this weeks call

2024-07-24 Thread David Malcolm via Gcc
On Wed, 2024-07-24 at 02:59 -0500, Thor Preimesberger via Gcc wrote:
> Sure - we actually already emit json in optinfo-emit-json.cc,  and
> there
> are implementations of json and pretty-printing/dumping it out also.
> I got
> a hacky version of our current raw dump working with json objects,
> but
> using the functions and data structures in tree-dump.* "as is" would
> require further processing of the dump's output (and I think some
> further
> modification) in order to make linking related nodes possible - I
> think, at
> least. This seems expensive computationally, so I'm currently
> re-implementing the function dump_generic_nodes from tree-pretty-
> print.cc
> so that it emits JSON instead. 

FWIW if you're working with gcc/json.h I just pushed various
improvements; see:
  https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658194.html

In particular:
  https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658195.html
saves me a *lot* of typing in gdb when debugging json stuff.

Hope this is helpful
Dave



Re: Preparation for this weeks call

2024-07-24 Thread Thor Preimesberger via Gcc
Ah, thanks! I took a quick look today and I'll look more today.

I should have a patch ready in about two weeks, but I'm gonna attach
some notes and a link to the git repository I'm working on, in case
anyone on the mailing list wants to take a look at it and provide
feedback and/or request things while I'm implementing the JSON dump
portion of the project. This is all very incomplete, so set your
expectations appropriately.

The repo itself is here: https://github.com/cheezeburglar/gcc
Most of the action is in tree-dump.cc right now, but I plan to split
off all the json dumping capability in it's own file later:
https://github.com/cheezeburglar/gcc/blob/master/gcc/tree-dump.cc

And, here's an email I sent to Richard (and forgot to cc to the
mailing list) along with notes that he gave me during a meeting:

In general the idea is that the dump flag -fdump-tree-original-json
emits something of the form

{"index": 1, "tree_code": foo, "child_1": {...}, ... "child_n": {...}}

or

{"index": 1, "tree_code": foo, "children": [{...}, {...}]}

where index denotes not any part of the tree class but the order in
which the nodes are traversed
and the "child_n" or "children" contain the index, tree_codes, and
other data in the child nodes.

In the above example, I don't think it's worth recursing over each
child nodes' child. And, if we
look at say COMPLEX_TYPE, we'd want each child node to contain at
least some of the data if not
all (save the child node's children).

One option to translating this to HTML is making each index
effectively a hyperlink to i's node,
so one could traverse the tree by clicking on each index. We could
also feed in a recursion depth parameter
here, so an individual HTML document would display one node, it's
children, and it's children's children
up to some depth (possibly 0). I haven't yet gotten the JSON emitted
in either of the above forms,
so I haven't looked at the technical details of this too closely yet.

//Look at debug_tree

//Show some examples to community, feedback for HTML once prototypical examples

// Also track addresses of tree_nodes, use for unique identifier
instead of visitation order?
// Make sure that we iterate over data fields in parent nest; in
particular make sure all those that are in RAW and maybe all

// Static_flag, e.g., has different semantics in different positions -
all in tree-core.h. Worry about overloaded meaning. see debug_tree for
inspiration
// Keep track of different semantics in differing positions.
dump_generic_node() might not do this

All of this is essentially just a debugging nicety, so I think it'd be
acceptable to write this
translator in Python. One would also, probably, call into this utility
from GDB. I have not yet
looked deeply into GDB's python API yet, but I *think* having this
script in Python would be
ever-so-more convenient. One would ultimately call some function from
GDB that takes in
some tree and displays all of the above in some web browser.

Best,
Thor

On Wed, Jul 24, 2024 at 6:05 PM David Malcolm  wrote:
>
> On Wed, 2024-07-24 at 02:59 -0500, Thor Preimesberger via Gcc wrote:
> > Sure - we actually already emit json in optinfo-emit-json.cc,  and
> > there
> > are implementations of json and pretty-printing/dumping it out also.
> > I got
> > a hacky version of our current raw dump working with json objects,
> > but
> > using the functions and data structures in tree-dump.* "as is" would
> > require further processing of the dump's output (and I think some
> > further
> > modification) in order to make linking related nodes possible - I
> > think, at
> > least. This seems expensive computationally, so I'm currently
> > re-implementing the function dump_generic_nodes from tree-pretty-
> > print.cc
> > so that it emits JSON instead.
>
> FWIW if you're working with gcc/json.h I just pushed various
> improvements; see:
>   https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658194.html
>
> In particular:
>   https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658195.html
> saves me a *lot* of typing in gdb when debugging json stuff.
>
> Hope this is helpful
> Dave
>


RE: [RISC-V] Combining vfmv and .vv instructions into a .vf instruction

2024-07-24 Thread Demin Han
Hi,

For this case:
1. last_combine1   the general pattern(plus (mult a b) c) can’t be combined.
2. last_bombine2   vec_duplicated is expanded to broadcast which can’t be 
handled by last_combine.

For RVV, I think last_combine2 has no chance to combine anything because of 
fully expansion.  Do I understand correctly?

After Robin’s review, a more general method needed, the policy is to prevent 
vec_duplicate from expanded to broadcast.
So more testcases are involved regardless of cmp.  I’m checking and fixing the 
failed testcases now.

Regards,
Demin

From: 钟居哲 
Sent: 2024年7月25日 6:24
To: Artemiy Volkov ; Demin Han 
; Jeff Law 
Cc: gcc ; rdapp.gcc 
Subject: Re: [RISC-V] Combining vfmv and .vv instructions into a .vf instruction

I think Demin is working on it. And Robin is reviewer of this stuff.


juzhe.zh...@rivai.ai

From: Artemiy Volkov
Date: 2024-07-25 01:25
To: juzhe.zh...@rivai.ai; 
demin@starfivetech.com; 
jeffreya...@gmail.com
CC: gcc@gcc.gnu.org
Subject: [RISC-V] Combining vfmv and .vv instructions into a .vf instruction
Hi Juzhe, Demin, Jeff,

This email is intended to continue the discussion started in
https://marc.info/?l=gcc-patches&m=170927452922009&w=2 about combining vfmv.v.f
and vfmxx.vv instructions into the scalar-vector form vfmxx.vf.

There was a mention on that thread of the potential usefulness of the 
late-combine
pass (added last month in
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=792f97b44ffc5e6a967292b3747fd835e99396e7)
in making this transformation. However, when I tried it out with my testcase at
https://godbolt.org/z/o8oPzo7qY, I found it unable to handle these complex
post-split1 patterns for broadcast and vfmacc:

(insn 129 128 130 3 (set (reg:RVVM4SF 168 [ _61 ])
(if_then_else:RVVM4SF (unspec:RVVMF8BI [
(const_vector:RVVMF8BI [
(const_int 1 [0x1]) repeated x16
])
(const_int 16 [0x10])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(vec_duplicate:RVVM4SF (mem:SF (reg:SI 143 [ ivtmp.21 ]) [1 
MEM[(float *)_145]+0 S4 A32]))
(unspec:RVVM4SF [
(reg:SI 0 zero)
] UNSPEC_VUNDEF))) "/app/example.c":19:53 4019 
{*pred_broadcastrvvm4sf_zvfh}
 (nil))
[ ... ]
(insn 131 130 34 3 (set (reg:RVVM4SF 139 [ D__lsm.10 ])
(if_then_else:RVVM4SF (unspec:RVVMF8BI [
(const_vector:RVVMF8BI [
(const_int 1 [0x1]) repeated x16
])
(const_int 16 [0x10])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(const_int 7 [0x7])
(reg:SI 66 vl)
(reg:SI 67 vtype)
(reg:SI 69 frm)
] UNSPEC_VPREDICATE)
(plus:RVVM4SF (mult:RVVM4SF (reg/v:RVVM4SF 135 [ row ])
(reg:RVVM4SF 168 [ _61 ]))
(reg:RVVM4SF 139 [ D__lsm.10 ]))
(unspec:RVVM4SF [
(reg:SI 0 zero)
] UNSPEC_VUNDEF))) "/app/example.c":19:36 15007 
{*pred_mul_addrvvm4sf_undef}
 (nil))

I'm no expert on this, but what's stopping us from adding some vector-scalar
split patterns alongside vector-vector ones in autovec.md to fix this? For
instance, the addition of fma4_scalar insn_and_split like this:

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index d5793ac..bf54d71 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1229,2 +1229,22 @@

+(define_insn_and_split "fma4_scalar"
+  [(set (match_operand:V_VLSF 0 "register_operand")
+(plus:V_VLSF
+ (mult:V_VLSF
+   (vec_duplicate:V_VLSF (match_operand:SF 1 
"direct_broadcast_operand"))
+   (match_operand:V_VLSF 2 "register_operand"))
+ (match_operand:V_VLSF 3 "register_operand")))]
+  "TARGET_VECTOR && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+rtx ops[] = {operands[0], operands[1], operands[2], operands[3],
+ operands[0]};
+riscv_vector::emit_vlmax_insn (code_for_pred_mul_scalar (PLUS, mode),
+  riscv_vector::TERNARY_OP_FRM_DYN, ops);
+DONE;
+  }
+  [(set_attr "type" "vector")])
+
;; -

does lead to vfmacc.vf instructions being emitted instead of vfmacc.vv's for the
testcase linked above.

What do you think about this approach to implement this optimization? Am I
missing anything important? Maybe split1 is too early to d