[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845649#comment-17845649
 ] 

ASF GitHub Bot commented on TIKA-4254:
--------------------------------------

kaiyaok2 commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2106037067

   @THausherr @tballison  I confirmed that the two lines in `@BeforeEach` 
**does not** create a new repo if one exists from a previous test run:
   ```
   TikaConfig config = TikaConfig.getDefaultConfig();
   repo = config.getMimeRepository();
   ```
   
   
   `TikaConfig.getDefaultConfig()` simply calls the default `TikaConfig()` 
constructor 
(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L390).
 
   
   When the system property `'tika.config'` and the environment variable 
`'TIKA_CONFIG'` are both not set, the `mimeTypes` field (accessible by 
`getMimeRepository()` - which is `repo` in our context) of the constructed 
config will be 
`getDefaultMimeTypes(getContextClassLoader())`(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L246).
   
   Now take a look at `getDefaultMimeTypes()` - when a classloader is given 
(`getContextClassLoader()` in our context), it first tries to retrieve from a 
HashMap via `CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader);` 
(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L150).
 Notice that `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` is not an instance variable, 
but a **static** `HashMap`. 
   
   So in the first test execution, the `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` is 
empty, so `types` after the line `types = 
CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader);` will be `null`, and is 
later initialized by `MimeTypesFactory.create()` as desired. After this, the 
initialized `types` is put to the static `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` 
map 
(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L166).
 
   
   Now in the second test execution, the `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` 
already has the key of the context class loader, with corresponding `types` 
initialized from the previous run. So 
`CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader)` will return such 
initialized object directly. In other words, `repo` **would be the same object 
across repeated test runs**.
   
   I think the essential idea of `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` is 1-to-1 
map between classloaders and default types, so this implementation does not 
seem buggy for me, but please confirm.  
   




> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4254
>                 URL: https://issues.apache.org/jira/browse/TIKA-4254
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Kaiyao Ke
>            Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>       at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>       at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>       at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>       at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>       at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>       at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIOInspector:rerun 
> -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
> ```
> ### Proposed Fix
> Declare `testType` and `testType2` as static variables and initialize them at 
> class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
> conflict each other. All tests pass and are idempotent after the fix.
> ### Necessity of Fix
> A fix is recommended as unit tests shall be idempotent, and state pollution 
> shall be mitigated so that newly introduced tests do not fail in the future 
> due to polluted shared states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to