This is an interesting proposal. However, I'd suggest changes to the current proposal.
I think that the current proposal is too invasive for the Pulsar code base. "Introduce thread monitor to check if thread is blocking for long time." seems to mean multiple things. When looking at the PR, it seems to be a solution for detecting long running tasks. Just FYI, that Bookkeeper has a solution for this in it's OrderedExecutor with a setting called enableTaskExecutionStats=true . I'm not saying that it would be the preferred way to implement it. If the goal is to detect actual blocking code that is run with threads that should run only non-blocking code, there's a better tool called Reactor BlockHound (https://github.com/reactor/BlockHound) for that purpose. For actual profiling of the code base, Java Flight Recorder and Async Profiler are better solutions. It seems that one part of the problem is that there aren't metrics for the thread pools. As an alternative implementation for the proposed PIP-232, I'd suggest that basic metrics (backlog / queue size, active thread count, number of executed tasks, etc) are added for the thread pools. For example, Micrometer contains a decorator for many thread pool implementations, https://github.com/micrometer-metrics/micrometer/blob/main/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/jvm/ExecutorServiceMetrics.java . A similar solution would be very useful in Pulsar for adding the thread pool metrics. Tracking individual tasks requires more resources, and that's why I'd suggest adding the basic metrics and making them enabled by default. Some more advanced metrics would be useful, such as tracking the thread pool queue waiting time. Adding a low overhead thread pool queue waiting time could be done with a sampling approach. The benefit of that is that there won't be a need to wrap all tasks that are executed. There would be several ways to implement the queue waiting time metric. I assume that "blocking" itself might not be the problem and therefore having basic metrics (backlog size, active threads, executed tasks counter, failed tasks counter) for the thread pools is more essential. There's a lot of good things about the PIP-232 proposal and I believe that iterating on the ideas will propose a good outcome. -Lari On 2022/12/19 12:17:09 adobewjl wrote: > Hello pulsar community, > I've opened `PIP-232: Introduce thread monitor to check if thread is blocked > for long time.` to discuss. > For more details, please read the PIP at > https://github.com/apache/pulsar/issues/18985 > I'm looking forward to hearing what you think. > Also the demo PR link at https://github.com/apache/pulsar/pull/18958