Bugzilla Automation <bugzi...@freebsd.org> has asked k...@freebsd.org for maintainer-feedback: Bug 231402: textproc/kf5-syntax-highlighting: does not build on systems with VLAN interfaces https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231402
--- Description --- kf5-syntax-highlighting build fails with undefined symbol error on a FreeBSD 11.2 system with at least one VLAN network interface. I know it is odd for network configuration on the system to affect the build, but it is really what I found after 3 days of debugging. Here are the error messages: [94/132] cd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data && /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/bin/katehig hlightingindexer /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data/index. katesyntax /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/syntax-highlightin g-5.49.0/data/schema/language.xsd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data/syntax -data.qrc FAILED: data/index.katesyntax cd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data && /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/bin/katehig hlightingindexer /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data/index. katesyntax /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/syntax-highlightin g-5.49.0/data/schema/language.xsd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data/syntax -data.qrc /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so: Undefined symbol "_ZN17QNetworkInterfaceC1ERKS_@Qt_5" ninja: build stopped: subcommand failed. I guess this is a memory corruption issue in Qt5 network module, which may provide the kernel a bad pointer and cause the kernel to overwrite data of the runtime linker. The symbol '_ZN17QNetworkInterfaceC1ERKS_' does exist in /usr/local/lib/qt5/libQt5Network.so.5 and /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so correctly lists libQt5Network.so.5 as its dependency with NEEDED, but the runtime linker rejects the symbol in libQt5Network.so.5 when comparing version tags. Steps to reproduce the problem: 1. Install FreeBSD 11.2 amd64 and download the ports tree. Whether it is a physical machine or a virtual machine doesn't matter. 2. Create a VLAN network interface. It can be done with command 'ifconfig vlan3 create vlan 3 vlandev re0' where 're0' is your network interface. 3. Make sure the runtime linker /libexec/ld-elf.so.1 is compiled with -O2 option. This is the default, so you don't have to do anything in this step unless you don't use binaries distributed by FreeBSD project. 4. Install textproc/qt5-xmlpatterns port with portmaster. 5. Build textproc/kf5-syntax-highlighting. It was tested on FreeBSD 11.2-RELEASE-p3 amd64 with ports revision 479821. I could reproduce it on 3 systems (physical machine, virtual machine, jail on virtual machine) and each of them runs on different hardware. I mentioned qt5-xmlpatterns above because it is an optional dependency of kf5-syntax-highlighting. kf5-syntax-highlighting can be built without problems when qt5-xmlpatterns is not installed, but it also means that it doesn't link to qt5-network. kf5-syntax-highlighting automatically picks up qt5-xmlpatterns during the configure phase and it is qt5-xmlpatterns that causes kf5-syntax-highlighting to load qt5-network during the build. The following are results of my debugging. I haven't found the root cause of the problem, but I think these notes may be useful to do further debugging. I started by checking symbol tables of both libqgenericbearer.so and libQt5Network.so.5. $ pkg which /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so was installed by package qt5-network-5.11.1 $ readelf -aW /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so Symbol table (.dynsym) contains 140 entries: Num: Value Size Type Bind Vis Ndx Name 69: 0000000000000000 21 FUNC GLOBAL DEFAULT UND _ZN17QNetworkInterfaceC1ERKS_@Qt_5 (2) $ pkg which /usr/local/lib/qt5/libQt5Network.so.5 /usr/local/lib/qt5/libQt5Network.so.5 was installed by package qt5-network-5.11.1 $ readelf -aW /usr/local/lib/qt5/libQt5Network.so.5 Symbol table (.dynsym) contains 2161 entries: Num: Value Size Type Bind Vis Ndx Name 1245: 00000000000c7790 21 FUNC GLOBAL DEFAULT 12 _ZN17QNetworkInterfaceC1ERKS_@@Qt_5 (3) The plugin links to libQt5Network.so.5 properly: $ ldd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/bin/katehig hlightingindexer /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/bin/katehig hlightingindexer: libQt5XmlPatterns.so.5 => /usr/local/lib/qt5/libQt5XmlPatterns.so.5 (0x800a00000) libQt5Network.so.5 => /usr/local/lib/qt5/libQt5Network.so.5 (0x801033000) libQt5Core.so.5 => /usr/local/lib/qt5/libQt5Core.so.5 (0x801400000) libc++.so.1 => /usr/lib/libc++.so.1 (0x801aec000) ... $ ldd /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so: libQt5Network.so.5 => /usr/local/lib/qt5/libQt5Network.so.5 (0x80120c000) libQt5Core.so.5 => /usr/local/lib/qt5/libQt5Core.so.5 (0x801600000) libc++.so.1 => /usr/lib/libc++.so.1 (0x801cec000) ... But the program which throws the undefined symbol error, katehighlightingindexer, doesn't link to libqgenericbearer.so. It suggests that libqgenericbearer.so is loaded by calling dlopen. I set a breakpoint on dlopen in GDB, and yes, it calls it with: dlopen("/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so", RTLD_NODELETE | RTLD_LAZY); The return value of dlopen is correct. It is properly loaded, and the hash of the version entry is 363045. (gdb) b dlopen Function "dlopen" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (dlopen) pending. (gdb) r 1 2 3 Starting program: /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/bin/katehig hlightingindexer 1 2 3 [New LWP 101325 of process 74133] Thread 1 hit Breakpoint 1, dlopen (name=0x805415498 "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so", mode=4097) at /usr/src/libexec/rtld-elf/rtld.c:3193 warning: Source file is more recent than executable. 3193 return (rtld_dlopen(name, -1, mode)); (gdb) finish Run till exit from #0 dlopen (name=0x805415498 "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so", mode=4097) at /usr/src/libexec/rtld-elf/rtld.c:3193 0x000000080165a731 in ?? () from /usr/local/lib/qt5/libQt5Core.so.5 Value returned is $2 = (void *) 0x80067e000 (gdb) p ((Obj_Entry *)(0x80067e000))->vertab[2] $3 = {hash = 363045, flags = 0, name = 0x807202678 "Qt_5", file = 0x8072025de "libQt5Network.so.5"} (gdb) p ((Obj_Entry *)(0x80067e000))->path $8 = 0x800634f40 "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so" The number '2' seems to come from the '(2)' suffix of the output of readelf. I assumes it means the version tag used by the symbol has index 2. (gdb) b _rtld_bind if $_streq(obj->path, "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so") && obj->vertab[2].hash != 363045 Breakpoint 3 at 0x80060f907: file /usr/src/libexec/rtld-elf/rtld.c, line 810. (gdb) c Continuing. [Switching to LWP 101325 of process 74133] Thread 2 hit Breakpoint 3, _rtld_bind (obj=0x80067e000, reloff=1272) at /usr/src/libexec/rtld-elf/rtld.c:810 810 rlock_acquire(rtld_bind_lock, &lockstate); (gdb) p obj->vertab[2] $17 = {hash = 32, flags = 0, name = 0x807202678 "Qt_5", file = 0x8072025de "libQt5Network.so.5"} The value of the hash field of the version entry has changed from 363045 to 32. The value '32' isn't random. I always get the same value here. If you follow the execution of the correct _rtld_bind call, you will find it fails to match the version tag at file /usr/src/libexec/rtld-elf/rtld.c, function matched_symbol, line 4329: 4329 if (obj->vertab[verndx].hash != req->ventry->hash || 4330 strcmp(obj->vertab[verndx].name, req->ventry->name)) { 4331 /* 4332 * Version does not match. Look if this is a 4333 * global symbol and if it is not hidden. If 4334 * global symbol (verndx < 2) is available, 4335 * use it. Do not return symbol if we are 4336 * called by dlvsym, because dlvsym looks for 4337 * a specific version and default one is not 4338 * what dlvsym wants. 4339 */ 4340 if ((req->flags & SYMLOOK_DLSYM) || 4341 (verndx >= VER_NDX_GIVEN) || 4342 (obj->versyms[symnum] & VER_NDX_HIDDEN)) 4343 return (false); 4344 } verndx is 2, and req->ventry->hash is 363045. If obj->vertab[2].hash hasn't been modified, the runtime linker will pick this symbol and the execution can continue. I tried to set a hardware watchpoint on obj->vertab[2].hash in GDB, but the watchpoint never hit. I also tried to set a software watchpoint on the same address, and the result wasn't always the same. Most of the time it ran forever and I interrupted it after a few minutes, but sometimes it stopped at instructions which should not modify the memory, such as 'mov r15,QWORD PTR fs:0x10' and 'mov r15,rdi'. Therefore, I thought the hash value was modified by the kernel, but 'catch syscall' command in GDB didn't seem to work for me. GDB kept printing 'Thread 2 received signal SIGSYS, Bad system call.' and made the program behave abnormally. I decided to use DTrace to track the hash value changes for me: # dtrace -n 'syscall:::entry, syscall:::return /pid == 99608/ { printf("%s %u ==> %x %x %x %x", probefunc, *(unsigned int *)copyin(0x801242230, 4), arg0, arg1, arg2, arg3); }' dtrace: description 'syscall:::entry, syscall:::return ' matched 2168 probes CPU ID FUNCTION:NAME 1 80243 ioctl:entry ioctl 363045 ==> 8 c0306938 7fffdfffd770 0 1 80244 ioctl:return ioctl 32 ==> 0 0 0 0 0x801242230 was the address of the hash variable obtained from GDB. It seems it was a 'ioctl(8, SIOCGIFMEDIA, 0x7fffdfffd730)' call that changed the value. 8 was a socket file descriptor created by calling 'socket(PF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0)'. 0x7fffdfffd730 looked like a pointer on the stack, as 'procstat -v' said this region grew down. I stopped debugging here and temporarily removed the VLAN interface with 'ifconfig vlan3 destroy' to let portmaster upgrade kf5-syntax-highlighting and hundreds of other ports for me. The conclusion is that I probably have to read the code of qt5-network in order to figure out what really happens. I found totally 3 ways to workaround the problem on systems affected by this problem: 1. Remove all VLAN interfaces, which may not be possible if your networking environment requires it. 2. Use Clang 6 shipped with FreeBSD base to recompile /libexec/ld-elf.so.1 with -O1, -O0, or -DDEBUG. 3. Use GCC 8 from ports to recompile /libexec/ld-elf.so.1 with -O0. Using -O1 or -DDEBUG doesn't help when using GCC. In fact, I didn't replace /libexec/ld-elf.so.1 on the system because it is risky. I did the test by either running the compiled ld-elf.so.1 under /usr/src/libexec/rtld-elf directly as an executable or modifying the interpreter path stored in katehighlightingindexer executable with 'patchelf --set-interpreter' command.