Nicholas DiPiazza created TIKA-4721:
---------------------------------------

             Summary: Fix intermittent locale-sensitive test failures in 
tika-pipes-integration-tests on Windows (tr_TR, de_DE)
                 Key: TIKA-4721
                 URL: https://issues.apache.org/jira/browse/TIKA-4721
             Project: Tika
          Issue Type: Bug
            Reporter: Nicholas DiPiazza


h2. Summary

The *main jdk17 windows build (multi-locale)* CI job fails intermittently on 
the {{main}} branch with test failures in the {{tika-pipes-integration-tests}} 
module ({{Apache Tika pipes core integration tests}}). The failures occur 
specifically under the Turkish ({{tr_TR.UTF-8}}) and German ({{de_DE.UTF-8}}) 
locales on Windows runners.

h2. Evidence

The following workflow runs on {{main}} failed with this issue before any 
related PR changes:

* [Run 24999911562|https://github.com/apache/tika/actions/runs/24999911562] -- 
job: build (17, tr_TR.UTF-8)
* [Run 24942999949|https://github.com/apache/tika/actions/runs/24942999949] -- 
job: build (17, tr_TR.UTF-8)
* [Run 24924376894|https://github.com/apache/tika/actions/runs/24924376894] -- 
job: build (17, tr_TR.UTF-8)
* [Run 24887823286|https://github.com/apache/tika/actions/runs/24887823286] -- 
job: build (17, tr_TR.UTF-8)

The Maven build output shows:

{noformat}
[INFO] Apache Tika pipes core integration tests ....... FAILURE [08:57 min]
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.5.5:test
        (default-test) on project tika-pipes-integration-tests: There are test 
failures.
{noformat}

h2. Root Cause (suspected)

Locale-sensitive string operations (e.g. {{String.toLowerCase()}} / 
{{String.toUpperCase()}} called without an explicit {{Locale}} argument) are 
the most common cause of this class of failure. In Turkish, 
{{"I".toLowerCase()}} produces {{ı}} (dotless i) instead of {{"i"}}, which 
breaks equality checks, map lookups, and assertions that assume English casing 
semantics.

A {{MockUpperCaseFilter}} in the test suite already uses {{Locale.US}} 
correctly, suggesting awareness of this issue, but the root cause may be in 
production code or config parsing hit by the integration tests.

h2. Steps to Reproduce

# Run the {{main-jdk17-windows-build-multi-locale}} GitHub Actions workflow on 
any branch
# Observe intermittent failures in the {{tika-pipes-integration-tests}} module 
under {{tr_TR.UTF-8}} locale

h2. Suggested Fix

# Identify the failing test(s) from the surefire report in a failed CI run
# Audit all {{toLowerCase()}} / {{toUpperCase()}} calls in {{tika-pipes}} and 
{{tika-pipes-integration-tests}} that do not pass an explicit {{Locale}}
# Replace with {{toLowerCase(Locale.ROOT)}} / {{toUpperCase(Locale.ROOT)}} 
where locale-independence is intended
# Consider adding a static analysis rule (e.g. Checkstyle {{IllegalMethodCall}} 
or SpotBugs {{DM_CONVERT_CASE}}) to prevent regressions




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to