[ https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845649#comment-17845649 ]
ASF GitHub Bot commented on TIKA-4254: -------------------------------------- kaiyaok2 commented on PR #1754: URL: https://github.com/apache/tika/pull/1754#issuecomment-2106037067 @THausherr @tballison I confirmed that the two lines in `@BeforeEach` **does not** create a new repo if one exists from a previous test run: ``` TikaConfig config = TikaConfig.getDefaultConfig(); repo = config.getMimeRepository(); ``` `TikaConfig.getDefaultConfig()` simply calls the default `TikaConfig()` constructor (https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L390). When the system property `'tika.config'` and the environment variable `'TIKA_CONFIG'` are both not set, the `mimeTypes` field (accessible by `getMimeRepository()` - which is `repo` in our context) of the constructed config will be `getDefaultMimeTypes(getContextClassLoader())`(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L246). Now take a look at `getDefaultMimeTypes()` - when a classloader is given (`getContextClassLoader()` in our context), it first tries to retrieve from a HashMap via `CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader);` (https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L150). Notice that `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` is not an instance variable, but a **static** `HashMap`. So in the first test execution, the `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` is empty, so `types` after the line `types = CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader);` will be `null`, and is later initialized by `MimeTypesFactory.create()` as desired. After this, the initialized `types` is put to the static `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` map (https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L166). Now in the second test execution, the `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` already has the key of the context class loader, with corresponding `types` initialized from the previous run. So `CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader)` will return such initialized object directly. In other words, `repo` **would be the same object across repeated test runs**. I think the essential idea of `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` is 1-to-1 map between classloaders and default types, so this implementation does not seem buggy for me, but please confirm. > The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the > first run and fails in repeated runs in the same environment. > -------------------------------------------------------------------------------------------------------------------------------------------- > > Key: TIKA-4254 > URL: https://issues.apache.org/jira/browse/TIKA-4254 > Project: Tika > Issue Type: Bug > Reporter: Kaiyao Ke > Priority: Major > > ### Brief Description of the Bug > The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the > first run but fails in the second run in the same environment. The source of > the problem is that each test execution initializes a new media type > (`MimeType`) instance `testType` (same problem for `testType2`), and all > media types across different test executions attempt to use the same name > pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of > the test, the line `this.repo.addPattern(testType, pattern, true);` will > throw an error, since the name pattern is already used by the `testType` > instance initiated from the first test execution. Specifically, in the second > run, the `addGlob()` method of the `Pattern` class will assert conflict > patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`). > ### Failure Message in the 2nd Test Run: > ``` > org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: > rtg_sst_grb_0\.5\.\d{8} > at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123) > at org.apache.tika.mime.Patterns.add(Patterns.java:71) > at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450) > at > org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > ``` > ### Reproduce > Use the `NIOInspector` plugin that supports rerunning individual tests in the > same environment: > ``` > cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package > mvn edu.illinois:NIOInspector:rerun > -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex > ``` > ### Proposed Fix > Declare `testType` and `testType2` as static variables and initialize them at > class loading time. Therefore, repeated runs of `testJavaRegex()` will not > conflict each other. All tests pass and are idempotent after the fix. > ### Necessity of Fix > A fix is recommended as unit tests shall be idempotent, and state pollution > shall be mitigated so that newly introduced tests do not fail in the future > due to polluted shared states. -- This message was sent by Atlassian Jira (v8.20.10#820010)