keith-turner commented on code in PR #5129:
URL: https://github.com/apache/accumulo/pull/5129#discussion_r1873494144
##########
core/src/main/java/org/apache/accumulo/core/fate/Fate.java:
##########
@@ -372,6 +375,42 @@ public Fate(T environment, FateStore<T> store, boolean
runDeadResCleaner,
break;
}
}
+ queueSizeHistory.clear();
+ } else {
+ // The property did not change, but should it based on the queue size?
Maintain
+ // the last X minutes of queue sizes. If the queue size is always
larger than the number
+ // of Fate threads multiplied by some factor, then suggest that the
+ // MANAGER_FATE_THREADPOOL_SIZE be increased.
+ final long interval = Math.min(60, TimeUnit.MILLISECONDS
+
.toMinutes(conf.getTimeInMillis(Property.MANAGER_FATE_QUEUE_CHECK_INTERVAL)));
+ if (interval == 0) {
+ queueSizeHistory.clear();
+ } else {
+ final int sizeFactor =
conf.getCount(Property.MANAGER_FATE_QUEUE_CHECK_FACTOR);
+ if (queueSizeHistory.size() >= interval * 2) { // this task runs
every 30s
+ final int warnThreshold = configured * sizeFactor;
+ boolean needMoreThreads = true;
+ for (Integer i : queueSizeHistory) {
+ if (i < warnThreshold) {
+ needMoreThreads = false;
+ break;
+ }
+ }
+ if (needMoreThreads) {
+ log.warn(
+ "Fate queue size is {} times the number of Fate threads for
the last {} minutes,"
+ + " consider increasing property: {}",
+ sizeFactor, interval,
Property.MANAGER_FATE_THREADPOOL_SIZE.getKey());
+ // Clear the history so that we don't log for another 5 minutes.
+ queueSizeHistory.clear();
+ } else {
+ while (queueSizeHistory.size() >= interval * 2) {
+ queueSizeHistory.remove();
+ }
+ }
+ }
+ queueSizeHistory.add(workQueue.size());
Review Comment:
Looking at how things work this queue size may always be zero (or something
low, not sure what it will actually be). When [this
code](https://github.com/apache/accumulo/blob/a2ffa7b956a0ba2b920c0128df194ffd782745cc/core/src/main/java/org/apache/accumulo/core/fate/Fate.java#L109)
places something on the queue it calls tryTransfer which should try to send
something directly to a waiting thread and avoid queuing.
The threads that run fate ops are in an infinite loop and just poll the
transfer queue. We could try to count the number of threads active in this
[block of
code](https://github.com/apache/accumulo/blob/a2ffa7b956a0ba2b920c0128df194ffd782745cc/core/src/main/java/org/apache/accumulo/core/fate/Fate.java#L160-L191)
and use that to detect business for this check. Could maybe use an atomic
integer and leverage the existing try/finally in the code for the counting.
This reminds of another problem I ran into related to this code. A bit ago
I was trying to use the fate thread pool metrics to understand what was going
on with fate and was not getting anything useful out of those metrics. Looking
into it determined that because the way this code works and puts task in an
infinite loop on the pool that was the reason the metrics were not useful. If
we added a counter of active fate threads for this PR it may be useful in
follow on PR to turn off merics for the fate thread pool and add a gauge that
exposes this counter. That metrics would be useful, could see how many fate
threads are busy at any given time. Can not get that info from the current
metrics.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]