On Tue, 3 Jun 2025 14:09:29 GMT, Johannes Bechberger <jbechber...@openjdk.org> wrote:
>> This is the code for the [JEP 509: CPU Time based profiling for >> JFR](https://openjdk.org/jeps/509). >> >> Currently tested using [this test >> suite](https://github.com/parttimenerd/basic-profiler-tests). This runs >> profiles the [Renaissance](https://renaissance.dev/) benchmark with >> - ... different heap sizes >> - ... different GCs >> - ... different samplers (the standard JFR and the new CPU Time Sampler and >> both) >> - ... different JFR recording durations >> - ... different chunk-sizes > > Johannes Bechberger has updated the pull request incrementally with one > additional commit since the last revision: > > Rename autoadapt src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 51: > 49: static bool is_excluded(JavaThread* jt) { > 50: return jt->is_hidden_from_external_view() || > 51: jt->jfr_thread_local()->is_excluded() || These restrictions cause a large blind spot in observability. There is no technical limitation for recording cpu samples for internal threads too, even without a Java stack trace. Consider removing this restriction, although not in this PR. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 135: > 133: while ((new_lost_samples = Atomic::cmpxchg(&_lost_samples, > lost_samples, 0)) != lost_samples) { > 134: lost_samples = new_lost_samples; > 135: } Why not `Atomic::xchg`? src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 161: > 159: return 0; > 160: } > 161: return os::active_processor_count() * 1000000000.0 / rate; If sampling period is configured as an absolute number in milliseconds, this value must be passed as is. Double conversion via `Runtime.availableProcessors()` / `active_processor_count()` is unobvious and error-prone. First, because of asymmetry: e.g. `Runtime.availableProcessors()` may be redefined by an agent so that its value is not aligned with `active_processor_count()`. Second, because number of available processors may change at runtime, e.g., by adjusting cgroup quotas. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 198: > 196: virtual void post_run(); > 197: public: > 198: virtual const char* name() const { return "JFR CPU Time Thread > Sampler"; } Thread name is too long and does not sound right. Logically, it is not "Thread Sampler", but rather "Sampler Thread", which also aligns with the existing "JFR Sampler Thread". But I'd simplify it to `JFR CPU Time Sampler` or maybe `JFR CPU Sampler Thread`. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 202: > 200: void run(); > 201: void on_javathread_create(JavaThread* thread); > 202: bool create_timer_for_thread(JavaThread* thread, timer_t &timerid); Should it be `private`? src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 252: > 250: timer_delete(*timer); > 251: tl->unset_cpu_timer(); > 252: tl->deallocate_cpu_time_jfr_queue(); Either this line is not needed or there is a possible resource leak: if `create_timer_for_thread` fails, queue is allocated but not deallocated. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 281: > 279: stop_timer(); > 280: Atomic::store(&_stop_signals, true); > 281: while (Atomic::load_acquire(&_active_signal_handlers) > 0) { There can be a race when `handle_timer_signal` has already passed `_stop_signals` check but has not yet incremented `_active_signal_handlers`. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 308: > 306: > 307: if > (Atomic::load_acquire(&_is_async_processing_of_cpu_time_jfr_requests_triggered)) > { > 308: > Atomic::release_store(&_is_async_processing_of_cpu_time_jfr_requests_triggered, > false); acquire/release seem to be used for no good reason. Also, this could be a single `cmpxchg`. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 326: > 324: if (jt->thread_state() != _thread_in_native || > !tl->try_acquire_cpu_time_jfr_dequeue_lock()) { > 325: tl->set_do_async_processing_of_cpu_time_jfr_requests(false); > 326: continue; // thread doesn't have a last Java frame or queue is > already being processed This comment may sound confusing, since `has_last_Java_frame` is checked separately below. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 472: > 470: > 471: void handle_timer_signal(int signo, siginfo_t* info, void* context) { > 472: assert(_instance != nullptr, "invariant"); There can be an arbitrary delay in async signal delivery. It's unlikely, but not impossible for `_instance` to be deleted by the time signal handler is called. There should be a better way to synchronize with JFR shutdown. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 477: > 475: > 476: > 477: void JfrCPUTimeThreadSampling::handle_timer_signal(siginfo_t* info, > void* context) { It may be a good idea to validate `info->si_code` in order to protect from things like `kill -SIGPROF` after profiling has stopped. For a similar reason, `_sampler->_stop_signals` should default to `true` whenever profiler is not running. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 509: > 507: JfrCPUTimeTraceQueue& queue = tl->cpu_time_jfr_queue(); > 508: if (!check_state(jt)) { > 509: queue.increment_lost_samples(); nit: wrong indent src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 576: > 574: } > 575: if (timer_create(clock, &sev, &t) < 0) { > 576: log_error(jfr)("Failed to register the signal handler for thread > sampling: %s", os::strerror(os::get_last_error())); If an application has many threads and current RLIMIT_SIGPENDING is low, logs will be flooded with this error message. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 606: > 604: void JfrCPUTimeThreadSampler::init_timers() { > 605: // install sig handler for sig > 606: PosixSignals::install_generic_signal_handler(SIG, > (void*)::handle_timer_signal); SIGPROF is also used by external profilers. Need to check if SIGPROF handler is already installed and warn user. src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.hpp line 58: > 56: volatile u4 _head; > 57: > 58: volatile s4 _lost_samples; Why signed int? Can it be negative? src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.hpp line 128: > 126: static void send_lost_event(const JfrTicks& time, traceid tid, s4 > lost_samples); > 127: > 128: // Trigger sampling while a thread is not in a safepoint, from a > seperate thread typo: separate src/hotspot/share/jfr/support/jfrThreadLocal.cpp line 558: > 556: void JfrThreadLocal::set_cpu_timer(timer_t* timer) { > 557: if (_cpu_timer == nullptr) { > 558: _cpu_timer = JfrCHeapObj::new_array<timer_t>(1); `timer_t` is a primitive type, at most one machine word. Why extra indirection and allocation? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2124528320 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2124503100 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125157311 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125128723 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125130332 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125190998 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125203700 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125342289 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125249171 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125230422 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125241099 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125320255 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125411074 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125430231 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2124507884 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125333563 PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125183931