[ 
https://issues.apache.org/jira/browse/IGNITE-28834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Vinogradov resolved IGNITE-28834.
---------------------------------------
    Resolution: Fixed

[~NSAmelchev] , thanks for your review!

> Test load scaling (TEST_SCALE_FACTOR / GridTestUtils.SF) is inert in RunAll 
> and unevenly applied
> ------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-28834
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28834
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Anton Vinogradov
>            Assignee: Anton Vinogradov
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h2. Problem
> The test-load scaling facility ({{{}GridTestUtils.SF{}}} / 
> {{{}ScaleFactorUtil{}}}, driven by the
> {{TEST_SCALE_FACTOR}} system property, range [0.1, 1.0]) is intended to 
> shrink load loops,
> iteration counts and time-boxed durations on CI. Two issues were found:
> h1. _The factor never reached the test JVMs in RunAll._ The 
> \{{IgniteTests24Java8_RunAll}}
> configuration declared a single \{{reverse.dep.*.TEST_SCALE_FACTOR = 0.1}}, 
> but it did not
> propagate to the leaf test suites (intermediate composite builds break the 
> wildcard
> reverse-dep). Each suite also carried its own \{{TEST_SCALE_FACTOR}} 
> (template default \{{1.0}},
> plus \{{0.1}} on Snapshots×8 and \{{0.2}} on SnapshotsWithIndexes). The 
> forked test JVMs in actual
> RunAll builds received \{{-DTEST_SCALE_FACTOR=1.0}} everywhere — i.e. scaling 
> was effectively
> disabled and all tests ran at full size.
> h1. _Coverage is uneven._ Measured over one RunAll run (build #9159468, 37.5h 
> / ~70k test runs):
> only ~28.8% of test wall-time is in classes that use \{{SF}} anywhere in 
> their hierarchy. ~17%
> is in non-covered classes that have a clearly scalable load constant; ~54% is 
> topology/fan-out
> (grid start/stop, parametrization) where \{{SF}} does not help.
> h2. Solution
> _Part 1 — CI configuration (done in TeamCity; settings are not versioned in 
> the repo):_
>  * Set \{{TEST_SCALE_FACTOR = 0.1}} once in the \{{Run tests (Java)}} 
> template (single source of truth).
>  * Removed the per-suite overrides (Snapshots×8, SnapshotsWithIndexes, 
> RollingUpgrade).
> RollingUpgrade uses a different template, so it keeps an explicit own 
> \{{0.1}}.
>  * Removed the dead \{{reverse.dep.*.TEST_SCALE_FACTOR}} from RunAll.
>  * Result: all 154 param-bearing build configs now resolve to \{{0.1}}; 
> verify a suite forks with
> {\{-DTEST_SCALE_FACTOR=0.1}} on the next RunAll.
> _Part 2 — Add SF scaling to long-running, non-covered tests (this PR):_
> Added \{{GridTestUtils.SF.applyLB(...)}} (with safe lower bounds, so the load 
> does not collapse at
> 0.1) to the heaviest tests whose time is dominated by scalable load loops / 
> time-boxes:
> || Class || Scaled ||
> | IgniteTxCacheWriteSynchronizationModesMultithreadedTest | load window 10s |
> | TxDeadlockDetectionNoHangsTest / TxDeadlockDetectionTest | 2-min run window 
> |
> | IgniteCacheGetRestartTest | TEST_TIME, KEYS |
> | CrossCacheTxRandomOperationsTest | 10s window |
> | SegmentedRingByteBufferTest | 60s producer/consumer windows |
> | TxPartitionCounterStateConsistencyTest | 30s restart windows |
> | IgnitePdsTransactionsHangTest | DURATION (kept > warm-up) |
> | CacheFreeListSelfTest | grow/shrink load (200k → LB 50k) |
> | ConcurrentCheckpointAndUpdateTtlTest | checkpoint loop |
> | IgniteCachePutAllRestartTest | 2-min + 60s put windows |
> Lower bounds keep each scenario meaningful at 0.1 (e.g. ≥20s for 
> deadlock-detection,
> ≥10s for tx/restart windows, ≥50k entries for free-list grow/shrink).
> h2. Scope / stopping criterion
> Candidates were stopped at the point where per-test saving drops below ~1 
> minute.
> Beyond the classes above, the next tier yields ~50s/test, and the remainder 
> is topology- or
> search-bound (no scalable load). Explicitly _not_ changed:
>  * {\{GridCommandHandlerConsistencyCountersTest}} — the \{{2_000}} preload is 
> a semantic threshold
> ("enough for historical rebalance"), not load.
>  * affinity-key search loops mis-detected as load (e.g.
> {\{CacheContinuousQueryAsyncFilterListenerTest}}, 
> \{{IgniteCacheClientNodeChangingTopologyTest}}).
>  * {\{CacheJdbcPojoWriteBehindStoreWithCoalescingTest}} — hang/coalescing 
> regression test where
> volume matters; scaling could weaken it.
> h2. Validation
>  * {\{mvn test-compile -pl modules/core}} green (JDK 17; note: JDK 21+ fails 
> on unrelated
> {\{Thread.suspend/resume}} usages).
>  * Per-test durations taken from RunAll build #9159468.
> h2. Follow-up (separate tickets)
>  * The remaining ~54% (CDC suites, snapshot-restore) is fan-out/IO-bound — 
> needs reduced data
> volume in \{{AbstractCdcTest}} / \{{AbstractSnapshotSelfTest}} or suite 
> re-balancing, evaluated
> for coverage impact separately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to