Re: [PATCH v2 6/7] Alpha: Add option to avoid data races for sub-longword memory stores [PR117759]

Jeff Law Mon, 06 Jan 2025 20:21:19 -0800



On 1/6/25 6:03 AM, Maciej W. Rozycki wrote:

With non-BWX Alpha implementations we have a problem of data races where
a 8-bit byte or 16-bit word quantity is to be written to memory in that
in those cases we use an unprotected RMW access of a 32-bit longword or
64-bit quadword width.  If contents of the longword or quadword accessed
outside the byte or word to be written are changed midway through by a
concurrent write executing on the same CPU such as by a signal handler
or a parallel write executing on another CPU such as by another thread
or via a shared memory segment, then the concluding write of the RMW
access will clobber them.  This is especially important for the safety
of RCU algorithms, but is otherwise an issue anyway.

To guard against these data races with byte and aligned word quantities
introduce the `-msafe-bwa' command-line option (standing for Safe Byte &
Word Access) that instructs the compiler to instead use an atomic RMW
access sequence where byte and word memory access machine instructions
are not available.  There is no change to code produced for BWX targets.

It would be sufficient for the secondary reload handle to use a pair of
scratch registers, as requested by `reload_out<mode>', but it would end
with poor code produced as one of the scratches would be occupied by
data retrieved and the other one would have to be reloaded with repeated
calculations, all within the LL/SC sequence.

Therefore I chose to add a dedicated `reload_out<mode>_safe_bwa' handler
and ask for more scratches there by defining a 256-bit OI integer mode.
While reload is documented in our manual to support an arbitrary number
of scratches in reality it hasn't been implemented for IRA:

/* ??? It would be useful to be able to handle only two, or more than
    three, operands, but for now we can only handle the case of having
    exactly three: output, input and one temp/scratch.  */

and it seems to be the case for LRA as well.  Do what everyone else does
then and just have one wide multi-register scratch.

I note that the atomic sequences emitted are suboptimal performance-wise
as the looping branch for the unsuccessful completion of the sequence
points backwards, which means it will be predicted as taken despite that
in most cases it will fall through.  I do not see it as a deficiency of
this change proposed as it takes care of recording that the branch is
unlikely to be taken, by calling `alpha_emit_unlikely_jump'.  Therefore
generic code elsewhere shou

Looks like this got truncated. Anyway, easy to forget how limitedbranch prediction was in this era. I haven't pondered static branchprediction in forever. I wouldn't worry too much about this case -- wecan always come back to it if generic doesn't do the right thing withcode layout.


Add test cases accordingly.

There are notable regressions between a plain `-mno-bwx' configuration
and a `-mno-bwx -msafe-bwa' one:

FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O0  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O1  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O2  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O3 -g  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -Os  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O2 -flto -fno-use-linker-plugin 
-flto-partition=none  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  execution test
FAIL: g++.dg/init/array25.C  -std=c++17 execution test
FAIL: g++.dg/init/array25.C  -std=c++98 execution test
FAIL: g++.dg/init/array25.C  -std=c++26 execution test

They come from the fact that these test cases play tricks with alignment
and end up calling code that expects a reference to aligned data but is
handed one to unaligned data.

This doesn't cause a visible problem with plain `-mno-bwx' code, because
the resulting alignment exception is fixed up by Linux.  There's no such
handling currently implemented for LDL_L or LDQ_L instructions (which
are first in the sequence) and consequently the offender is issued with
SIGBUS instead.  Suitable handling will be added to Linux to complement
this change, so these regressions are seen as harmless and expected.

        gcc/
        PR target/117759
        * config/alpha/alpha-modes.def (OI): New integer mode.
        * config/alpha/alpha-protos.h (alpha_expand_mov_safe_bwa): New
        prototype.
        * config/alpha/alpha.cc (alpha_expand_mov_safe_bwa): New
        function.
        (alpha_secondary_reload): Handle TARGET_SAFE_BWA.
        * config/alpha/alpha.md (aligned_store_safe_bwa)
        (unaligned_store<mode>_safe_bwa, reload_out<mode>_safe_bwa)
        (reload_out<mode>_unaligned_safe_bwa): New expanders.
        (mov<mode>, movcqi, reload_out<mode>_aligned): Handle
        TARGET_SAFE_BWA.
        (reload_out<mode>): Guard against TARGET_SAFE_BWA.
        * config/alpha/alpha.opt (msafe-bwa): New option.
        * config/alpha/alpha.opt.urls: Regenerate.
        * doc/invoke.texi (Option Summary, DEC Alpha Options): Document
        the new option.

        gcc/testsuite/
        PR target/117759
        * gcc.target/alpha/stb.c: New file.
        * gcc.target/alpha/stb-bwa.c: New file.
        * gcc.target/alpha/stb-bwx.c: New file.
        * gcc.target/alpha/stba.c: New file.
        * gcc.target/alpha/stba-bwa.c: New file.
        * gcc.target/alpha/stba-bwx.c: New file.
        * gcc.target/alpha/stw.c: New file.
        * gcc.target/alpha/stw-bwa.c: New file.
        * gcc.target/alpha/stw-bwx.c: New file.
        * gcc.target/alpha/stwa.c: New file.
        * gcc.target/alpha/stwa-bwa.c: New file.
        * gcc.target/alpha/stwa-bwx.c: New file.
---
  NB I note that there is a warning in gcc/config/alpha/sync.md that it is
unpredictable if the lock_flag is cleared or not by a normal load or store
executed on the same CPU and therefore we need to make sure no register
spill is inserted in the sequence.  I seem not to have seen it actually
happen and testing results with actual hardware do look good.

If it's a real issue in practice and we have to revisit this code, thenwe could look at hard barriers before/after the sequence to preventscheduling into the sequence. Or we could emit the entire sequence asan atomic unit much like some ports do for inlined subword atomics.


  However out of the abundance of caution we may want to make sure it can't
happen.  It should be quite a straightforward change, but owing to the
number of issues encountered, as indicated by the size of the patchset,
and the limited time I did not get to it in time for Stage 1 closure.  So
I've chosen to post this change for review anyway with the intent to make
a suitable update in the coming weeks.  As it only affects newly-added
`-msafe-bwa' option I don't think it will be disruptive to the pre-release
stabilisation process.  I will appreciate input for this part

  NB2 I reckon the manual ought to be updated to say only one scratch is
permitted for secondary reloads.  While it's documented otherwise since
forever, it's never actually matched reality.

ISTM that could be a follow-up.  Consider such an update pre-approved.

OK for the trunk.

jeff

Re: [PATCH v2 6/7] Alpha: Add option to avoid data races for sub-longword memory stores [PR117759]

Reply via email to