MartijnVisser opened a new pull request, #28573:
URL: https://github.com/apache/flink/pull/28573

   ## What is the purpose of the change
   
   `YARNSessionFIFOSecuredITCase` (and the other YARN session ITCases) flakily 
fails in the `checkForProhibitedLogContents` hook, which scans the 
TaskManager/JobManager logs for prohibited substrings such as `Exception`. A 
benign transient WARN is emitted by Pekko during association teardown when a 
remote system is gated and the reconnect is refused. That line was already 
whitelisted, but the whitelist regex assumed the cause was a bare 
`java.net.ConnectException: Connection refused: <host>`. Current Pekko/Netty 
wraps the cause in 
`org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException`,
 so the line no longer matches and the transient WARN trips the check.
   
   This is the long-standing prohibited-log-contents flakiness tracked under 
FLINK-21995; it most recently surfaced as `testDetachedMode` (FLINK-26514). The 
example WARN line observed in CI:
   
   ```
   WARN org.apache.pekko.remote.ReliableDeliverySupervisor - Association with 
remote system [pekko.tcp://flink@<host>:<port>] has failed, address is now 
gated for [50] ms. Reason: [Association failed with 
[pekko.tcp://flink@<host>:<port>]] Caused by: 
[org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: <host>/<ip>:<port>, caused by: java.net.ConnectException: 
Connection refused]
   ```
   
   This change removes one source of false positives in the 
prohibited-log-contents check. It does not claim to address every flaky cause 
tracked under FLINK-21995.
   
   ## Brief change log
   
     - Extend the `Caused by:` portion of the existing gated-reassociation 
whitelist pattern in `YarnTestBase.WHITELISTED_STRINGS` to accept both the old 
direct `java.net.ConnectException` form and the new Netty-wrapped 
`AbstractChannel$AnnotatedConnectException` form. The cause class is pinned to 
those two known types (not any class mentioning "Connection refused"), so an 
unrelated exception cannot slip through; the rest of the pattern stays specific 
to the gated re-association WARN.
     - Add a positive probe for the wrapped form and a negative probe (a 
gated-reassociation WARN with a non-connection cause, still containing 
"Connection refused" text) to `YarnTestBaseTest`, guarding the discriminating 
property of the pattern.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - `YarnTestBaseTest` exercises `YarnTestBase.WHITELISTED_STRINGS` 
end-to-end. Positive probes cover the bare `java.net.ConnectException` form, 
the new Netty-wrapped `AnnotatedConnectException` form, and the existing 
`Broken pipe` line. A new negative probe asserts that a gated-reassociation 
WARN with an unrelated cause (e.g. `IllegalStateException`, and a 
`NullPointerException: Connection refused ...` decoy) is NOT whitelisted.
     - Red/green confirmed: against the previous regex the wrapped-form probe 
fails with the exact message seen in CI; against the relaxed-but-pinned regex 
it passes, and the negative probe fails if the pattern is loosened to match 
"Connection refused" alone.
     - `spotless:check` passes on `flink-yarn-tests`.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change in 
`flink-yarn-tests`)
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Code, Opus 4.8)
   
   Generated-by: Claude Code (Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to