On Tue, 3 Jun 2025 14:09:29 GMT, Johannes Bechberger <jbechber...@openjdk.org> 
wrote:

>> This is the code for the [JEP 509: CPU Time based profiling for 
>> JFR](https://openjdk.org/jeps/509).
>> 
>> Currently tested using [this test 
>> suite](https://github.com/parttimenerd/basic-profiler-tests). This runs 
>> profiles the [Renaissance](https://renaissance.dev/) benchmark with
>> - ... different heap sizes
>> - ... different GCs
>> - ... different samplers (the standard JFR and the new CPU Time Sampler and 
>> both)
>> - ... different JFR recording durations
>> - ... different chunk-sizes
>
> Johannes Bechberger has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Rename autoadapt

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 51:

> 49: static bool is_excluded(JavaThread* jt) {
> 50:   return jt->is_hidden_from_external_view() ||
> 51:          jt->jfr_thread_local()->is_excluded() ||

These restrictions cause a large blind spot in observability. There is no 
technical limitation for recording cpu samples for internal threads too, even 
without a Java stack trace. Consider removing this restriction, although not in 
this PR.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 135:

> 133:   while ((new_lost_samples = Atomic::cmpxchg(&_lost_samples, 
> lost_samples, 0)) != lost_samples) {
> 134:     lost_samples = new_lost_samples;
> 135:   }

Why not `Atomic::xchg`?

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 161:

> 159:     return 0;
> 160:   }
> 161:   return os::active_processor_count() * 1000000000.0 / rate;

If sampling period is configured as an absolute number in milliseconds, this 
value must be passed as is.
Double conversion via `Runtime.availableProcessors()` / 
`active_processor_count()` is unobvious and error-prone. First, because of 
asymmetry: e.g. `Runtime.availableProcessors()` may be redefined by an agent so 
that its value is not aligned with `active_processor_count()`. Second, because 
number of available processors may change at runtime, e.g., by adjusting cgroup 
quotas.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 198:

> 196:   virtual void post_run();
> 197: public:
> 198:   virtual const char* name() const { return "JFR CPU Time Thread 
> Sampler"; }

Thread name is too long and does not sound right. Logically, it is not "Thread 
Sampler", but rather "Sampler Thread", which also aligns with the existing "JFR 
Sampler Thread". But I'd simplify it to `JFR CPU Time Sampler` or maybe `JFR 
CPU Sampler Thread`.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 202:

> 200:   void run();
> 201:   void on_javathread_create(JavaThread* thread);
> 202:   bool create_timer_for_thread(JavaThread* thread, timer_t &timerid);

Should it be `private`?

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 252:

> 250:   timer_delete(*timer);
> 251:   tl->unset_cpu_timer();
> 252:   tl->deallocate_cpu_time_jfr_queue();

Either this line is not needed or there is a possible resource leak: if 
`create_timer_for_thread` fails, queue is allocated but not deallocated.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 281:

> 279:     stop_timer();
> 280:     Atomic::store(&_stop_signals, true);
> 281:     while (Atomic::load_acquire(&_active_signal_handlers) > 0) {

There can be a race when `handle_timer_signal` has already passed 
`_stop_signals` check but has not yet incremented `_active_signal_handlers`.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 308:

> 306: 
> 307:     if 
> (Atomic::load_acquire(&_is_async_processing_of_cpu_time_jfr_requests_triggered))
>  {
> 308:       
> Atomic::release_store(&_is_async_processing_of_cpu_time_jfr_requests_triggered,
>  false);

acquire/release seem to be used for no good reason. Also, this could be a 
single `cmpxchg`.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 326:

> 324:       if (jt->thread_state() != _thread_in_native || 
> !tl->try_acquire_cpu_time_jfr_dequeue_lock()) {
> 325:         tl->set_do_async_processing_of_cpu_time_jfr_requests(false);
> 326:         continue; // thread doesn't have a last Java frame or queue is 
> already being processed

This comment may sound confusing, since `has_last_Java_frame` is checked 
separately below.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 472:

> 470: 
> 471: void handle_timer_signal(int signo, siginfo_t* info, void* context) {
> 472:   assert(_instance != nullptr, "invariant");

There can be an arbitrary delay in async signal delivery.
It's unlikely, but not impossible for `_instance` to be deleted by the time 
signal handler is called. There should be a better way to synchronize with JFR 
shutdown.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 477:

> 475: 
> 476: 
> 477: void JfrCPUTimeThreadSampling::handle_timer_signal(siginfo_t* info, 
> void* context) {

It may be a good idea to validate `info->si_code` in order to protect from 
things like `kill -SIGPROF` after profiling has stopped. For a similar reason, 
`_sampler->_stop_signals` should default to `true` whenever profiler is not 
running.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 509:

> 507:   JfrCPUTimeTraceQueue& queue = tl->cpu_time_jfr_queue();
> 508:   if (!check_state(jt)) {
> 509:       queue.increment_lost_samples();

nit: wrong indent

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 576:

> 574:   }
> 575:   if (timer_create(clock, &sev, &t) < 0) {
> 576:     log_error(jfr)("Failed to register the signal handler for thread 
> sampling: %s", os::strerror(os::get_last_error()));

If an application has many threads and current RLIMIT_SIGPENDING is low, logs 
will be flooded with this error message.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 606:

> 604: void JfrCPUTimeThreadSampler::init_timers() {
> 605:   // install sig handler for sig
> 606:   PosixSignals::install_generic_signal_handler(SIG, 
> (void*)::handle_timer_signal);

SIGPROF is also used by external profilers. Need to check if SIGPROF handler is 
already installed and warn user.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.hpp line 58:

> 56:   volatile u4 _head;
> 57: 
> 58:   volatile s4 _lost_samples;

Why signed int? Can it be negative?

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.hpp line 128:

> 126:   static void send_lost_event(const JfrTicks& time, traceid tid, s4 
> lost_samples);
> 127: 
> 128:   // Trigger sampling while a thread is not in a safepoint, from a 
> seperate thread

typo: separate

src/hotspot/share/jfr/support/jfrThreadLocal.cpp line 558:

> 556: void JfrThreadLocal::set_cpu_timer(timer_t* timer) {
> 557:   if (_cpu_timer == nullptr) {
> 558:     _cpu_timer = JfrCHeapObj::new_array<timer_t>(1);

`timer_t` is a primitive type, at most one machine word. Why extra indirection 
and allocation?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2124528320
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2124503100
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125157311
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125128723
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125130332
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125190998
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125203700
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125342289
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125249171
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125230422
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125241099
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125320255
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125411074
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125430231
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2124507884
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125333563
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125183931

Reply via email to