On Tue, 5 Dec 2023 12:11:46 GMT, Joachim Kern <jk...@openjdk.org> wrote:
>> On AIX, repeated calls to dlopen referring to the same shared library may >> result in different, unique dl handles to be returned from libc. In that it >> differs from typical libc implementations that cache dl handles. >> >> This causes problems in the JVM with code that assumes equality of handles. >> One such problem is in the JVMTI agent handler. That problem was fixed with >> a local fix to said handler >> ([JDK-8315706](https://bugs.openjdk.org/browse/JDK-8315706)). However, this >> fix causes follow-up problems since it assumes that the file name passed to >> `os::dll_load()` is the file that has been opened. It prevents efficient, >> os_aix.cpp-local workarounds for other AIX issues like the *.so/*.a duality. >> See [JDK-8320005](https://bugs.openjdk.org/browse/JDK-8320005). As such, it >> is a hack that causes other, more uglier hacks to follow (see discussion of >> https://github.com/openjdk/jdk/pull/16604). >> >> We propose a different, cleaner way of handling this: >> >> - Handle this entirely inside the AIX versions of os::dll_load and >> os::dll_unload. >> - Cache dl handles; repeated opening of a library should return the cached >> handle. >> - Increase handle-local ref counter on open, Decrease it on close >> - Make sure calls to os::dll_load are matched to os::dll_unload (See >> [JDK-8320830](https://bugs.openjdk.org/browse/JDK-8320830)). >> >> This way we mimic dl handle equality as it is implemented on other >> platforms, and this works for all callers of os::dll_load. > > Joachim Kern has updated the pull request incrementally with one additional > commit since the last revision: > > encapsulate everything in os::Aix::dlopen Excellent, this is how I have pictured a good solution. Very nice. A number of remarks, but nothing fundamental. src/hotspot/os/aix/os_aix.cpp line 1137: > 1135: if (ebuf != nullptr && ebuflen > 0) { > 1136: ::strncpy(ebuf, "dll_load: empty filename specified", ebuflen - > 1); > 1137: } Are there any cases where we don't hand in the error buffer? If so, I would just assert ebuf and ebuflen. No need for this kind of flexibility. src/hotspot/os/aix/os_aix.cpp line 3051: > 3049: > 3050: // Simulate the library search algorithm of dlopen() (in os::dll_load) > 3051: int os::Aix::stat64x_via_LIBPATH(const char* path, struct stat64x* > stat) { - no need to export this, make it filescope static - please return bool, with false = error - please rename it to something like "search_file_in_LIBPATH" src/hotspot/os/aix/os_aix.cpp line 3055: > 3053: return -1; > 3054: > 3055: char *path2 = strdup (path); Please use os::strdup and os::free. If you really intent to use the plain libc versions, use `::strdup` and `::free` to make sure - and indicate to code readers - you use the global libc variants. src/hotspot/os/aix/os_aix.cpp line 3059: > 3057: int idx = strlen(path2) - 1; > 3058: if (path2[idx] == ')') { > 3059: while (path2[idx] != '(' && idx > 0) idx--; ? Why not `strrchr()`? src/hotspot/os/aix/os_aix.cpp line 3067: > 3065: if (path2[0] == '/' || > 3066: (path2[0] == '.' && (path2[1] == '/' || > 3067: (path2[1] == '.' && path2[2] == '/')))) { This complexity is not needed, nor is it sufficient, since it does not handle relative paths ("mydirectory/hallo.so") https://www.ibm.com/docs/en/aix/7.1?topic=d-dlopen-subroutine "If FilePath contains a slash character, FilePath is used directly, and no directories are searched. " So, just scan for a '/' - if you find one, its a path to be opened directly: const bool use_as_filepath = strchr(path2, '/'); src/hotspot/os/aix/os_aix.cpp line 3085: > 3083: strcpy(libpath, env); > 3084: for (token = strtok_r(libpath, ":", &saveptr); token != nullptr; > token = strtok_r(nullptr, ":", &saveptr)) { > 3085: sprintf(combined, "%s/%s", token, path2); You can save a lot of pain and manual labor by using `stringStream` here. stringStream combined; combined.print("%s/%s", token, path2); const char* combined_path_string = combined.base(); no need for manual allocation and byte counting. src/hotspot/os/aix/os_aix.cpp line 3099: > 3097: // filled by os::dll_load(). This way we mimic dl handle equality for a > library > 3098: // opened a second time, as it is implemented on other platforms. > 3099: void* os::Aix::dlopen(const char* filename, int Flags) { Does not need to be exported, nor does os::AIX::dlclose. Make file scope static. See my remarks in os_posix.cpp. src/hotspot/os/aix/os_aix.cpp line 3103: > 3101: struct stat64x libstat; > 3102: > 3103: if (os::Aix::stat64x_via_LIBPATH(filename, &libstat)) { Please return bool, not unix int -1, this hurts my brain :-) src/hotspot/os/aix/os_aix.cpp line 3108: > 3106: if (result != nullptr) { > 3107: assert(false, "dll_load: Could not stat() file %s, but dlopen() > worked; Have to improve stat()", filename); > 3108: } Since this is just assert code, I'd wrap all this stuff in #ifdef ASSERT. No need for needless dlopens otherwise. src/hotspot/os/aix/os_aix.cpp line 3125: > 3123: } > 3124: if (i == g_handletable_used) { > 3125: // library not still loaded. Check if there is space left in array s/still/yet src/hotspot/os/aix/os_aix.cpp line 3131: > 3129: pthread_mutex_unlock(&g_handletable_mutex); > 3130: assert(false, "max_handletable reached"); > 3131: return nullptr; Please, for the sake of release code, hand in an error buffer and fill it with something that makes sense, eg. "too many libraries loaded". The assert is still okay, I guess, since we don't expect it to fire during tests; if it does fire, it may indicate a problem in our handle table logic or a wrong assumption about handle équality. src/hotspot/os/aix/os_aix.cpp line 3133: > 3131: return nullptr; > 3132: } > 3133: // library not still loaded and still place in array, so load > library s/still/yet src/hotspot/os/aix/os_aix.cpp line 3143: > 3141: g_handletable[i].devid = libstat.st_dev; > 3142: g_handletable[i].refcount = 1; > 3143: } Error handling: on error, call dlerror and return error string inside the error buffer you should hand in. All other platforms do this too. src/hotspot/os/aix/os_aix.cpp line 3150: > 3148: } > 3149: > 3150: int os::Aix::dlclose(void* lib) { can we call lib something better, maybe "handle"? src/hotspot/os/aix/os_aix.cpp line 3165: > 3163: // refcount == 0, so we have to ::dlclose() the lib > 3164: // and delete the entry from the array. > 3165: res = ::dlclose(lib); Handle dlclose error. We expect it to work; if it doesn't, it indicates that something is wrong with the handle logic, e.g. an invalid or already closed handle had been handed in. So, assert. src/hotspot/os/aix/os_aix.hpp line 185: > 183: // opened a second time, as it is implemented on other platforms. > 184: static void* dlopen(const char* filename, int Flags); > 185: static int dlclose(void* lib); Remove; should not be exported. src/hotspot/os/posix/os_posix.cpp line 735: > 733: l_path = "<not available>"; > 734: } > 735: int res = AIX_ONLY(os::Aix)::dlclose(lib); Lets do this cleaner, and in the style of hotspot coding elsewhere: - introduce a new function "os::pd_dll_unload(handle, errorbuf, errbuflen)". Add it to os.hpp, but somewhere non-public. The implementations will live in os_aix.cpp, os_bsd.cpp and os_linux.cpp. - make os::Aix::dlclose -> os::pd_dll_unload; the only difference is that you should fill in error buffer with either ::dlerror or, if you have errors in handle table, a text describing that error - on all other posix platforms (os_bsd.cpp + os_linux.cpp), implement a minimal version of os::pd_dll_unload() that calls ::dlunload, and on error calls ::dlerror and copies the string into errbuf - Here, call os::pd_dll_unload instead of ::dlclose/os::aix::dlclose - change the JFR code below to not use ::dlerror but the string returned from the buffer ------------- Changes requested by stuefe (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16920#pullrequestreview-1762354122 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415568694 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415577023 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415588986 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415577568 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415585089 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415594396 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415625152 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415595301 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415596399 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415597594 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415599081 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415601350 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415607920 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415612828 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415619700 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415625511 PR Review Comment: https://git.openjdk.org/jdk/pull/16920#discussion_r1415638462