date:20200810

RFC: -fno-share-inlines

2020-08-10 Thread Allan Sandfeld Jensen

Following the previous discussion, this is a proposal for a patch that adds 
the flag -fno-share-inlines that can be used when compiling singular source 
files with a different set of flags than the rest of the project.

It basically turns off comdat for inline functions, as if you compiled without 
support for 'weak' symbols. Turning them all into "static" functions, even if 
that wouldn't normally be possible for that type of function. Not sure if it 
breaks anything, which is why I am not sending it to the patch list.

I also considered alternatively to turn the comdat generation off later during 
assembler production to ensure all processing and optimization of comdat 
functions would occur as normal.

Best regards
Allandiff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index 2b1aca16eb4..78e1f592126 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -1803,6 +1803,10 @@ frtti
 C++ ObjC++ Optimization Var(flag_rtti) Init(1)
 Generate run time type descriptor information.

+fshare-inlines
+C C++ ObjC ObjC++ Var(flag_share_inlines) Init(1)
+Emit non-inlined inlined declared functions to be shared between object files.
+
 fshort-enums
 C ObjC C++ ObjC++ LTO Optimization Var(flag_short_enums)
 Use the narrowest integer type possible for enumeration types.
diff --git a/gcc/cp/decl2.c b/gcc/cp/decl2.c
index 33c83773d33..8de796d16fc 100644
--- a/gcc/cp/decl2.c
+++ b/gcc/cp/decl2.c
@@ -1957,10 +1957,9 @@ adjust_var_decl_tls_model (tree decl)
 void
 comdat_linkage (tree decl)
 {
-  if (flag_weak)
-make_decl_one_only (decl, cxx_comdat_group (decl));
-  else if (TREE_CODE (decl) == FUNCTION_DECL
-	   || (VAR_P (decl) && DECL_ARTIFICIAL (decl)))
+  if ((!flag_share_inlines || !flag_weak)
+  && (TREE_CODE (decl) == FUNCTION_DECL
+	  || (VAR_P (decl) && DECL_ARTIFICIAL (decl
 /* We can just emit function and compiler-generated variables
statically; having multiple copies is (for the most part) only
a waste of space.
@@ -1978,6 +1977,8 @@ comdat_linkage (tree decl)
should perform a string comparison, rather than an address
comparison.  */
 TREE_PUBLIC (decl) = 0;
+  else if (flag_weak)
+make_decl_one_only (decl, cxx_comdat_group (decl));
   else
 {
   /* Static data member template instantiations, however, cannot

Re: 10-12% performance decrease in benchmark going from GCC8 to GCC9

2020-08-10 Thread Jonathan Wakely via Gcc

Hi Matt,

The best thing to do here is file a bug report with the code to reproduce it:
https://gcc.gnu.org/bugzill

Thanks

On Sat, 8 Aug 2020 at 23:01, Soul Studios  wrote:
>
> Hi all,
> recently have been working on a new version of the plf::colony container
> (plflib.org) and found GCC9 was giving 10-12% worse performance on a
> given benchmark than GCC8.
>
> Previous versions of the colony container did not experience this
> performance loss going from GCC8 to GCC9.
> However Clang 6 and MSVC2019 show no performance loss going from the old
> colony version to the new version.
>
> The effect is repeatable across architectures - I've tested on xubuntu,
> windows running nuwen mingw, and on Core2 and Haswell CPUs, with and
> without -march=native specified.
>
> Compiler flags are: -O2;-march=native;-std=c++17
>
> Code is attached with an absolute minimum use-case - other benchmarks
> have not shown such strong performance differences - including both
> simpler and more complex tests.
> So I cannot reduce further, please do not ask me to do so.
>
> The benchmark in question inserts into a container initially then
> iterates over container elements repeatedly, randomly erasing and/or
> inserting new elements.
>
>
> In addition I've attached the assembly output under both GCC8 and GCC9.
> In this case I have output from 8.2 and 9.2 respectively, but the same
> effects apply to 8.4 and 9.3. The output for 8 is a lot larger than 9,
> wondering if there's more unrolling occurring.
>
> Any questions let me know. I will help where I can, but my knowledge of
> assembly is limited. If supplying the older version of colony is useful
> I'm happy to do so.
>
> Nanotimer is a ~nanosecond-precision sub-timeslice cross-platform timer.
> Colony is a bucket-array-like unordered sequence container.
> Thanks,
> Matt
>
>

Re: 10-12% performance decrease in benchmark going from GCC8 to GCC9

2020-08-10 Thread Bill Schmidt via Gcc




On 8/10/20 3:30 AM, Jonathan Wakely via Gcc wrote:

Hi Matt,

The best thing to do here is file a bug report with the code to reproduce it:
https://gcc.gnu.org/bugzill

Thanks



Also, be sure to follow the instructions at https://gcc.gnu.org/bugs/.

Bill



On Sat, 8 Aug 2020 at 23:01, Soul Studios  wrote:

Hi all,
recently have been working on a new version of the plf::colony container
(plflib.org) and found GCC9 was giving 10-12% worse performance on a
given benchmark than GCC8.

Previous versions of the colony container did not experience this
performance loss going from GCC8 to GCC9.
However Clang 6 and MSVC2019 show no performance loss going from the old
colony version to the new version.

The effect is repeatable across architectures - I've tested on xubuntu,
windows running nuwen mingw, and on Core2 and Haswell CPUs, with and
without -march=native specified.

Compiler flags are: -O2;-march=native;-std=c++17

Code is attached with an absolute minimum use-case - other benchmarks
have not shown such strong performance differences - including both
simpler and more complex tests.
So I cannot reduce further, please do not ask me to do so.

The benchmark in question inserts into a container initially then
iterates over container elements repeatedly, randomly erasing and/or
inserting new elements.


In addition I've attached the assembly output under both GCC8 and GCC9.
In this case I have output from 8.2 and 9.2 respectively, but the same
effects apply to 8.4 and 9.3. The output for 8 is a lot larger than 9,
wondering if there's more unrolling occurring.

Any questions let me know. I will help where I can, but my knowledge of
assembly is limited. If supplying the older version of colony is useful
I'm happy to do so.

Nanotimer is a ~nanosecond-precision sub-timeslice cross-platform timer.
Colony is a bucket-array-like unordered sequence container.
Thanks,
Matt

Almost an order of magnitude faster __udimodti4() for AMD64

2020-08-10 Thread Stefan Kanthak

Hi @ll,

I don't use GCC, so I don't know whether there's a benchmark
for __udivmodti4() and/or __udivmoddi4() for AMD64 and i386
processors.

If you have one: get my "slow" __udivmodti4() from

and run the benchmark, then my fast __udivmodti4() from

and repeat.
The "slow" __udivmodti4() should be slightly faster than your
current implementation for AMD64, while the fast one almost
an order of magnitude...

shows my numbers.

And while you're there, also benchmark __udivmoddi4() from
,
__umoddi3() from
,
__moddi3() from
,
as well as (after trivial editing) __udivdi3() from

and __divdi3() from


regards
Stefan

Re: Changes to allow PowerPC to change the long double type to use the IEEE 128-bit floating point format

2020-08-10 Thread Michael Meissner via Gcc

On Sat, Aug 08, 2020 at 03:33:51PM +0200, Thomas König wrote:
> Hi Michael,
> 
> I have shortened the distribution list somewhat for the Fortran-relevant
> parts.
> 
> >I want to discuss changes that I think we need to make across the open source
> >toochain to allow us to change the long double type on PowerPC hardware from
> >using the IBM extended double (i.e. a pair of doubles) to the IEEE 128-bit
> >format defined in IEEE 754.
> >
> >I wasn't sure whom to address this to, so I took a scatter shot approach.  I
> >likely missed a few people, and some people were added that may not need to
> >participate in the discussion.  Sorry for either not including you initially 
> >or
> >for including you by mistake.
> >
> >I added people from the following areas:
> >
> > PowerPC folk
> >
> > Langugage maintainers: At the moment, only the C/C++ front ends have
> > code to support both 128-bit floating point types.  The other languages
> > use just the defaults provided by the machine maintainers.  However, it
> > may be we will need to think about rules for code being compiled and
> > linked with a different long double format.
> 
> Currently, we support the IBM format with gfortran.  So, we have to
> look at a) library code called from user routines, and b) user code
> compiled with one version calling another.

I have patches I'm working through that allows the whole toolchain to be built
with the new default, and I have run through the C, C++, and Fortran test
suites with a new version of glibc.  In fact there were 2-3 tests that
traditionally fail with IBM extended double that now pass.

IMHO, changing the default is only appropriate for times like a distribution
major number changes, where backwards and forwards compatibility is carefully
controlled.  But before we can contemplate doing this, we need the ability to
change the default and have it all work together.

In this case, there were no modifications to the gfortran sources.  It is all
controlled from the rs6000 backend gcc and libgcc machine specific functions.
But there is no backwards compatibility if the user used explicit long double
(in C/C++) or real*16 (gfortran).

One of the keys is changing the names of the built-in functions.  Glibc in
math.h changes all of the long double functions it declares to be either the
IBM extended double or IEEE 128-bit versions of the functions.  Similarly
Libstdc++ is in the middle of doing the changes for tht as well.

In addition, the GNU compiler will change the external names of the built-in
functions to be the IEEE 128-bit names.  This allows for C/C++ users to not use
math.h and still get the right function called (there were a few tests in the
test suite that had to be fixed).  It also allows Gfortran to use these
functions by default.

> 
> As for a), this is something that can be done using the right m4
> macros. We might even, with some hackery, be able to provide two
> versions of the functions with the new library.

Glibc and Libstdc++ are doing this right now.  Tulio, Carlos can speak of
glibc, and Johnathon can speak of libstdc++'s efforts.  Generally in a mixed
library approach, you provide both names, and you do not use long double,
instead you use the explicit types (__ibm128 and __float128) for the two
formats.  Typically they use the -mno-gnu-attributes option which says to
disable the gnu attributes that we use to mark which type of long double is
used, so that it can be linked with either ABI.

> For b), I do not have a clear solution for Fortran. But I also have
> no idea how this is supposed to work in other languages,
> when a user uses code compiles something with "long double"
> with gcc 10 and links it against "long double" with gcc 11 -
> what are your plans for that?
> 
> It would be possible to annotate every function with calls long double
> with the new format somehow (ugh), or you could make a clear ABI break.
> This would mean a new, incompatible version of libgfortran, but we would
> have to restrict that to POWER (would that be possible?).
> We cannot impose an ABI change on everybody else to this.

We do annotate each function that has long double arguments or returns long
double arguments already with gnu attributes.  There are some issues with it,
and I want to delve into it deeper.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.ibm.com, phone: +1 (978) 899-4797

Problem cropping up in Value Range Propogation

2020-08-10 Thread Gary Oblock via Gcc

I'm trying to debug a problem cropping up in value range propagation.
Ironically I probably own an original copy 1995 copy of the paper it's
based on but that's not going to be much help since I'm lost in the
weeds.  It's running on some optimization (my structure reorg
optimization) generated GIMPLE statements.

Here's the GIMPLE dump:

Function max_of_y (max_of_y, funcdef_no=1, decl_uid=4391, cgraph_uid=2, 
symbol_order=20) (executed once)

max_of_y (unsigned long data, size_t len)
{
  double value;
  double result;
  size_t i;

   [local count: 118111600]:
  field_arry_addr_14 = _reorg_base_var_type_t.y;
  index_15 = (sizetype) data_27(D);
  offset_16 = index_15 * 8;
  field_addr_17 = field_arry_addr_14 + offset_16;
  field_val_temp_13 = MEM  [(void *)field_addr_17];
  result_8 = field_val_temp_13;
  goto ; [100.00%]

   [local count: 955630225]:
  _1 = i_3 * 16;
  PPI_rhs1_cast_18 = (unsigned long) data_27(D);
  PPI_rhs2_cast_19 = (unsigned long) _1;
  PtrPlusInt_Adj_20 = PPI_rhs2_cast_19 / 16;
  PtrPlusInt_21 = PPI_rhs1_cast_18 + PtrPlusInt_Adj_20;
  dedangled_27 = (unsigned long) PtrPlusInt_21;
  field_arry_addr_23 = _reorg_base_var_type_t.y;
  index_24 = (sizetype) dedangled_27;
  offset_25 = index_24 * 8;
  field_addr_26 = field_arry_addr_23 + offset_25;
  field_val_temp_22 = MEM  [(void *)field_addr_26];
  value_11 = field_val_temp_22;
  if (result_5 < value_11)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 477815112]:

   [local count: 955630225]:
  # result_4 = PHI 
  i_12 = i_3 + 1;

   [local count: 1073741824]:
  # i_3 = PHI <1(2), i_12(5)>
  # result_5 = PHI 
  if (i_3 < len_9(D))
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 118111600]:
  # result_10 = PHI 
  return result_10;
}

The failure in VRP is occurring on

offset_16 = data_27(D) * 8;

which is the from two adjacent statements above

  index_15 = (sizetype) data_27(D);
  offset_16 = index_15 * 8;

being merged together.

Note, the types of index_15/16 are sizetype and data_27 is unsigned
long.
The error message is:

internal compiler error: tree check: expected class ‘type’, have ‘exceptional’ 
(error_mark) in to_wide,

Things only start to look broken in value_range::lower_bound in
value-range.cc when

return wi::to_wide (t);

is passed error_mark_node in t. It's getting it from m_min just above.
My observation is that m_min is not always error_mark_node. In fact, I
seem to think you need to use set_varying to get this to even happen.

Note, the ssa_propagation_engine processed the statement "offset_16 =
data..."  multiple times before failing on it. What oh what is
happening and how in the heck did I cause it???

Please, somebody throw me a life preserver on this.

Thanks,

Gary


CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for 
the sole use of the intended recipient(s) and contains information that is 
confidential and proprietary to Ampere Computing or its subsidiaries. It is to 
be used solely for the purpose of furthering the parties' business 
relationship. Any review, copying, or distribution of this email (or any 
attachments thereto) is strictly prohibited. If you are not the intended 
recipient, please contact the sender immediately and permanently delete the 
original and any copies of this email and any attachments thereto.

RFC: -fno-share-inlines

Re: 10-12% performance decrease in benchmark going from GCC8 to GCC9

Re: 10-12% performance decrease in benchmark going from GCC8 to GCC9

Almost an order of magnitude faster __udimodti4() for AMD64

Re: Changes to allow PowerPC to change the long double type to use the IEEE 128-bit floating point format

Problem cropping up in Value Range Propogation

6 matches

Site Navigation

Mail list logo

Footer information