Hello-

There is no shortage of PRs complaining about things that go wrong when the
line_maps data structure in libcpp starts to run into its limits. Being
restricted to a 32-bit location_t to cover all source locations means it has
the following limitations, among others:

    -Column numbers larger than 2^12 are not tracked at all, so diagnostics
     can only refer to the whole line, and things like -Wmisleading-indentation
     stop working with large column numbers. [PR89549]

    -Tokens longer than 32 characters cannot be stored compactly in the line
     map and have to be stored as ad-hoc locations with the associated
     memory and locality overhead.

    -Sufficiently large sources can exhaust the available location_t space.

The current limitations are not encountered too much, but it does happen
often enough, especially in testsuites or large PCH compiles that include,
e.g., every header in a large project.

Currently, the main workaround is to use the option -flarge-source-locations.
This sets the range bits in the line_maps to 0, effectively meaning that
every source code token is stored as an ad-hoc location. This does
significantly increase the number of locations that can be stored, but it
loses all the benefits of the line_maps infrastructure, and it does not help
with the 4096 limit for column numbers.

It seemed to me like there's no way to improve on that other than letting
location_t be 64 bits instead of 32 bits. I thought I would take a look at
what that would entail... the attached 15 (mostly small) patches are the
result. It was unfortunately necessary to touch a lot of different parts of
the compiler, since location_t is used pretty pervasively. This patch set
adds a configure option --enable-large-source-locations that makes
location_t be 64 bits. The option is off by default.

There are decisions to be made about what to do with all those new bits. I
made some choices in this patch, but it's easy to change if different
tradeoffs are preferable. Here is what I did:

    - Only 63 bits are used, the high bit is not used so that two location_t
      can be safely subtracted and stored in an int64_t.

    - As in the 32-bit case, half of the available locations are reserved
      for ad-hoc locations, and the other half for macros and ordinary
      locations.

    - The limit on the maximum column number is removed, so it's now set to
      2^31-1. (GCC right now pervasively assumes line and column numbers
      within a file can be represented by a 32-bit signed int. That could be
      changed too, but it is not part of this patch and would be another
      similarly large change.)

    - The default range bits is changed from 5 to 7. This means that
      tokens up to 128 chars in length get ordinary locations, while longer
      ones get ad-hoc locations. I think that's a big improvement from the
      current max of 32 and will result in a lot fewer ad-hoc locations
      being generated. There are probably enough bits to go around, though,
      that this could be made even larger too.

All that stuff is configured in line-map.h and is easy to change. The bulk
of the patch is dealing with the fallout from making location_t's type
configurable; in particular, C++ modules needed a lot of tweaks, since they
deal with line_map internals, and I had to add a new field to RTL which
necessitated a lot of small changes. I was not previously familiar with
RTL... I think I generally grokked the idea there and I think the changes I
made are safe, but I am pretty sure some of them are unnecessary so would
appreciate feedback on that too.

I tried to test on a variety of platforms. For each of them, I did three
bootstrap + regtest (without the patch, with the patch and without
--enable-large-source-locations, and with the patch and with
--enable-large-source-locations) with --enable-checking=yes,extra,rtl and
verified no change in the testsuite other than the expected new PASSes. Here
is what I have tested so far:

all languages: x86_64, aarch64 (cfarm185), ppc64le (cfarm135)
c/c++ only: sparc 32-bit (cfarm216)

The testing on sparc 32-bit was very productive and in particular identified
a few issues with padding and alignment that have not been apparent until
now. That platform differs from x86 32-bit in that uint64_t has a larger
alignment than a pointer does.

Also it's worth noting, even if there is no interest in supporting this
feature, patches 2, 3, 4, 5, and 6 I think are worth accepting regardless;
these are small bug fixes that came up in testing and are independent of
whether location_t is changed or not... they could be triggered by other
changes in the future too.

Regarding memory usage implications of this change. It does make a lot of
data structures larger. Here is what it does on x86-64. The base sizes of some
types change as follows:

location_t:              From 4 to 8 (+100%)
line_map_ordinary:       From 24 to 32 (+33%)
line_map_macro:          From 32 to 40 (+25%)
cpp_token:               From 24 to 32 (+33%)
cpp_macro:               From 48 to 56 (+17%)
tree_exp:                From 32 to 40 (+25%)
gimple:                  From 40 to 48 (+20%)

The compiled size of stdc++.h.gch increases from 210MB to 219MB (+4%).

Compiling testsuite/g++.dg/modules/xtreme-header.h into a module, I see:

size of the output module: From 21.9MB to 21.6MB (-1%)
peak memory usage: From 211MB to 235MB (+11%)

The reason the module gets smaller is that it streams ordinary locations
more efficiently than adhoc locations, and with the 64-bit location_t, the
number of adhoc locations was reduced from 189,891 to 101,571.

I'm curious to hear your thoughts on this one please. I think it could be
nice potentially to get it into GCC 15, and consider changing the default in
GCC 16 or later, if you think it's worth the trouble. If it seems too
invasive a change and not worth it, I'll certainly understand, and I have
enjoyed the tour through the compiler internals in any case.

FWIW I feel like this issue will need to be dealt with some time or other. I
think users generally expect not to run into arbitrary limits, and compiling
a lot of headers together into one module is going to become possibly more
common as modules become available by default. I think any attempt to deal
with it will have to share at least a lot of the features of this one.

-Lewis

01/15: Support for 64-bit location_t: libcpp parts
02/15: libcpp: Fix potential unaligned access in cpp_buffer
03/15: tree-cfg: Fix call to next_discriminator_for_locus()
04/15: tree-phinodes: Use 4 instead of 2 as the minimum number of phi args
05/15: c++: Fix tree_contains_struct for TRAIT_EXPR
06/15: gimple: Handle tail padding when computing gimple_ops_offset
07/15: Support for 64-bit location_t: toplev parts
08/15: Support for 64-bit location_t: Analyzer parts
09/15: Support for 64-bit location_t: Frontend parts
10/15: Support for 64-bit location_t: C++ modules parts
11/15: Support for 64-bit location_t: RTL parts
12/15: Support for 64-bit location_t: Backend parts
13/15: Support for 64-bit location_t: Internal parts
14/15: Support for 64-bit location_t: Testsuite parts
15/15: Support for 64-bit location_t: Configury parts

64 files changed, 841 insertions(+), 386 deletions(-)

Reply via email to