Nicholas DiPiazza created TIKA-4721:
---------------------------------------
Summary: Fix intermittent locale-sensitive test failures in
tika-pipes-integration-tests on Windows (tr_TR, de_DE)
Key: TIKA-4721
URL: https://issues.apache.org/jira/browse/TIKA-4721
Project: Tika
Issue Type: Bug
Reporter: Nicholas DiPiazza
h2. Summary
The *main jdk17 windows build (multi-locale)* CI job fails intermittently on
the {{main}} branch with test failures in the {{tika-pipes-integration-tests}}
module ({{Apache Tika pipes core integration tests}}). The failures occur
specifically under the Turkish ({{tr_TR.UTF-8}}) and German ({{de_DE.UTF-8}})
locales on Windows runners.
h2. Evidence
The following workflow runs on {{main}} failed with this issue before any
related PR changes:
* [Run 24999911562|https://github.com/apache/tika/actions/runs/24999911562] --
job: build (17, tr_TR.UTF-8)
* [Run 24942999949|https://github.com/apache/tika/actions/runs/24942999949] --
job: build (17, tr_TR.UTF-8)
* [Run 24924376894|https://github.com/apache/tika/actions/runs/24924376894] --
job: build (17, tr_TR.UTF-8)
* [Run 24887823286|https://github.com/apache/tika/actions/runs/24887823286] --
job: build (17, tr_TR.UTF-8)
The Maven build output shows:
{noformat}
[INFO] Apache Tika pipes core integration tests ....... FAILURE [08:57 min]
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:3.5.5:test
(default-test) on project tika-pipes-integration-tests: There are test
failures.
{noformat}
h2. Root Cause (suspected)
Locale-sensitive string operations (e.g. {{String.toLowerCase()}} /
{{String.toUpperCase()}} called without an explicit {{Locale}} argument) are
the most common cause of this class of failure. In Turkish,
{{"I".toLowerCase()}} produces {{ı}} (dotless i) instead of {{"i"}}, which
breaks equality checks, map lookups, and assertions that assume English casing
semantics.
A {{MockUpperCaseFilter}} in the test suite already uses {{Locale.US}}
correctly, suggesting awareness of this issue, but the root cause may be in
production code or config parsing hit by the integration tests.
h2. Steps to Reproduce
# Run the {{main-jdk17-windows-build-multi-locale}} GitHub Actions workflow on
any branch
# Observe intermittent failures in the {{tika-pipes-integration-tests}} module
under {{tr_TR.UTF-8}} locale
h2. Suggested Fix
# Identify the failing test(s) from the surefire report in a failed CI run
# Audit all {{toLowerCase()}} / {{toUpperCase()}} calls in {{tika-pipes}} and
{{tika-pipes-integration-tests}} that do not pass an explicit {{Locale}}
# Replace with {{toLowerCase(Locale.ROOT)}} / {{toUpperCase(Locale.ROOT)}}
where locale-independence is intended
# Consider adding a static analysis rule (e.g. Checkstyle {{IllegalMethodCall}}
or SpotBugs {{DM_CONVERT_CASE}}) to prevent regressions
--
This message was sent by Atlassian Jira
(v8.20.10#820010)