gianm commented on code in PR #19378:
URL: https://github.com/apache/druid/pull/19378#discussion_r3148832368
##########
docs/ingestion/supervisor.md:
##########
@@ -193,29 +193,30 @@ The following example shows a supervisor spec with
`lagBased` autoscaler:
```
</details>
-**2. Cost-based autoscaler strategy (experimental)**
+**2. Cost-based autoscaler strategy**
-An autoscaler which computes the required supervisor task count via cost
function based on ingestion lag and poll-to-idle ratio.
-Task counts are selected from a bounded range derived from the current
partitions-per-task ratio,
-not strictly from factors/divisors of the partition count. This bounded
partitions-per-task window enables gradual scaling while
-voiding large jumps and still allowing non-divisor task counts when needed.
+The cost-based autoscaler picks the number of ingestion tasks that minimizes a
combined cost score. The score has two components:
-**It is experimental and the implementation details as well as cost function
parameters are subject to change.**
+- **Lag cost** — how long it would take to drain the current backlog at the
observed processing rate. More tasks reduce this cost.
+- **Idle cost** — how far the predicted idle ratio is from the target of ~25%.
Tasks that are too busy (under-provisioned) or too idle (over-provisioned) both
drive the score up.
+The sweet spot is roughly 25% idle, giving headroom to absorb traffic spikes
without wasting resources.
+
+At every evaluation interval, Druid computes the score for each candidate task
count and picks the one with the lowest total cost.
Note: Kinesis is not supported yet, support is in progress.
Review Comment:
Is it really in progress?
##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java:
##########
@@ -64,21 +66,19 @@ public class CostBasedAutoScaler implements
SupervisorTaskAutoScaler
public static final String OPTIMAL_TASK_COUNT_METRIC =
"task/autoScaler/costBased/optimalTaskCount";
public static final String INVALID_METRICS_COUNT =
"task/autoScaler/costBased/invalidMetrics";
- static final int MAX_INCREASE_IN_PARTITIONS_PER_TASK = 2;
- static final int MAX_DECREASE_IN_PARTITIONS_PER_TASK =
MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2;
-
/**
- * If average partition lag crosses this value and the processing rate is
- * still zero, scaling actions are skipped and an alert is raised.
+ * Maximum number of candidate task counts to evaluate above or below the
current task count
+ * when scale-up or scale-down boundaries are enabled.
+ * <p>
+ * The misspelling is preserved to avoid unnecessary churn in this
package-private constant.
Review Comment:
I don't understand this comment. The constant is new in this patch. Please
fix the spelling.
##########
embedded-tests/src/test/java/org/apache/druid/testing/embedded/indexing/autoscaler/CostBasedAutoScalerIntegrationTest.java:
##########
@@ -152,8 +152,8 @@ public void
test_autoScaler_computesOptimalTaskCountAndProducesScaleUp()
.taskCountStart(lowInitialTaskCount)
.scaleActionPeriodMillis(500)
.minTriggerScaleActionFrequencyMillis(1000)
- .lagWeight(0.2)
- .idleWeight(0.8)
+ .lagWeight(0.8)
Review Comment:
What will be the effect of the change to lag and idle weights?
##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/WeightedCostFunction.java:
##########
@@ -24,17 +24,23 @@
/**
* Weighted cost function using compute time as the core metric.
- * Costs represent actual time in seconds, making them intuitive and
debuggable.
- * Uses linear scaling without mode inversions for predictable behavior.
+ * Lag cost is based on recovery time in seconds; idle cost is a penalty
derived from
+ * the predicted idle ratio.
+ *
+ * <p>Idle cost uses a U-shaped penalty with minimum at {@link
#IDEAL_IDLE_RATIO}.
+ * This penalizes both under-provisioning (low idle, no safety margin, lag
risk) and
+ * over-provisioning (high idle, wasted capacity), with asymmetric severity
controlled by
+ * {@link #UNDER_PROVISIONING_PENALTY} and {@link #OVER_PROVISIONING_PENALTY}.
*/
public class WeightedCostFunction
{
private static final Logger log = new Logger(WeightedCostFunction.class);
+
/**
* Multiplier for a lag amplification factor; it was carefully chosen
* during extensive testing as the most balanced multiplier for high-lag
recovery.
*/
- static final double LAG_AMPLIFICATION_MULTIPLIER = 0.05;
+ static final double LAG_AMPLIFICATION_MULTIPLIER = 0.4;
Review Comment:
Why this change?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]