This is an interesting proposal. However, I'd suggest changes to the current 
proposal.
I think that the current proposal is too invasive for the Pulsar code base. 
"Introduce thread monitor to check if thread is blocking for long time." seems 
to mean multiple things. 
When looking at the PR, it seems to be a solution for detecting long running 
tasks. Just FYI, that Bookkeeper has a solution for this in it's 
OrderedExecutor with a setting called enableTaskExecutionStats=true . I'm not 
saying that it would be the preferred way to implement it.

If the goal is to detect actual blocking code that is run with threads that 
should run only non-blocking code, there's a better tool called Reactor 
BlockHound (https://github.com/reactor/BlockHound) for that purpose. 
For actual profiling of the code base, Java Flight Recorder and Async Profiler 
are better solutions.

It seems that one part of the problem is that there aren't metrics for the 
thread pools. As an alternative implementation for the proposed PIP-232, I'd 
suggest that basic metrics (backlog / queue size, active thread count, number 
of executed tasks, etc)  are added for the thread pools. For example, 
Micrometer contains a decorator for many thread pool implementations, 
https://github.com/micrometer-metrics/micrometer/blob/main/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/jvm/ExecutorServiceMetrics.java
 . A similar solution would be very useful in Pulsar for adding the thread pool 
metrics.

Tracking individual tasks requires more resources, and that's why I'd suggest 
adding the basic metrics and making them enabled by default. Some more advanced 
metrics would be useful, such as tracking the thread pool queue waiting time. 
Adding a low overhead thread pool queue waiting time could be done with a 
sampling approach. The benefit of that is that there won't be a need to wrap 
all tasks that are executed. There would be several ways to implement the queue 
waiting time metric. 

I assume that "blocking" itself might not be the problem and therefore having 
basic metrics (backlog size, active threads, executed tasks counter, failed 
tasks counter) for the thread pools is more essential. There's a lot of good 
things about the PIP-232 proposal and I believe that iterating on the ideas 
will propose a good outcome.

-Lari

On 2022/12/19 12:17:09 adobewjl wrote:
> Hello pulsar community,
> I've opened `PIP-232: Introduce thread monitor to check if thread is blocked 
> for long time.` to discuss.
> For more details, please read the PIP at 
> https://github.com/apache/pulsar/issues/18985
> I'm looking forward to hearing what you think. 
> Also the demo PR link at https://github.com/apache/pulsar/pull/18958

Reply via email to