https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513

            Bug ID: 112513
           Summary: Misoptimization of argument
           Product: gcc
           Version: 12.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 56569
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56569&action=edit
Preprocessed source

In the NVIDIA NCCL library (https://github.com/NVIDIA/nccl) I came across a
SIGSEGV in __strncmp_sse42 that happens "sometimes" when compiled with GCC 12
in -O2 mode and higher but don't happen in lower modes or in GCC 11

The original stacktrace looks like this:
#0  0x00002aaaabd82e3a in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x00002aab18d83a6e in xmlGetAttrIndex (index=<synthetic pointer>,
attrName=0x2aab18e2820c "familyid", node=0x2aae9c108160) at graph/xml.h:67
#2  xmlSetAttrInt (value=143, attrName=0x2aab18e2820c "familyid",
node=0x2aae9c108160) at graph/xml.h:167
#3  ncclTopoGetXmlFromCpu (cpuNode=cpuNode@entry=0x2aae9c108160,
xml=xml@entry=0x2aae9c0d1f20) at graph/xml.cc:436

Moving the `strncmp(key, attrName, MAX_STR_LEN) == 0` out into a separate
function to see the arguments in the debugger shows this backtrace:
#0  0x00002aaaabd83c00 in __strncmp_sse42 () from /lib64/libc.so.6
#1  0x00002aab18d75a9f in cmpFromXml (attrName=0x89300800 <error: Cannot access
memory at address 0x89300800>, key=0x2aaeac107eb0 "numaid") at graph/xml.h:65
#2  xmlGetAttrIndex (index=<synthetic pointer>, attrName=0x89300800 <error:
Cannot access memory at address 0x89300800>, node=0x2aaeac107db0) at
graph/xml.h:73
#3  xmlSetAttrInt (node=node@entry=0x2aaeac107db0,
attrName=attrName@entry=0x89300800 <error: Cannot access memory at address
0x89300800>, value=143) at graph/xml.h:174
#4  0x00002aab18d77de4 in ncclTopoGetXmlFromCpu
(cpuNode=cpuNode@entry=0x2aaeac107db0, xml=xml@entry=0x2aaeac0d1b70) at
graph/xml.cc:437

So it looks like the `attrName` parameter gets corrupted somehow. The callsite
of `xmlSetAttrInt` is `NCCLCHECK(xmlSetAttrInt(cpuNode, "familyid",
familyId));`, so that parameter is a string constant already used earlier by
`NCCLCHECK(xmlGetAttrIndex(cpuNode, "familyid", &index));`

I suspect the `index` parameter to be involved.
Many modifications cause the bug to disappear, such as removing the `NCCLCHECK`
macro (basically an `if(error) return error;`-wrapper) or adding
fprintf-statements into xmlGetAttrIndex or cmpFromXml

The compile command is `g++ -fPIC -fvisibility=hidden -std=c++11 -O2 -g -ggdb3
-c graph/xml.cc`, the preprocessed source. Needs minimization but as it only
happens when compiled into a library used by a python package from a script I
don't know how. So I hope that there will be something obvious for someone
familiar with the optimization in GCC

Reply via email to