zamf commented on PR #2769: URL: https://github.com/apache/tika/pull/2769#issuecomment-4259726213
> Here's what copilot says after told it that adjusting the test is the wrong priority: > > Adding a globally registered parser changes the set/order of parsers that AutoDetectParser discovers. ServiceLoader iteration order is not guaranteed, and changes in classpath/jar ordering in CI can affect parser selection/behavior. Even if the new parser isn’t intended for ODT, it can still perturb overall parser discovery and embedded parsing flow, which in turn changes the emitted SAX events and thus the extracted phone-number order. > > Additionally, EncodeOCRParser currently advertises support for some non-ocr-* image types (image/jp2, image/jpx, image/x-portable-pixmap), which makes it “more eligible” than intended and increases the chance of unintended participation. Recommended fix (don’t weaken the test; fix the regression) > > ``` > 1. Restrict EncodeOCRParser supported types to only image/ocr-* (opt-in via override), removing the non-ocr- image types. That keeps it from being considered for generic image parsing and reduces collateral changes in parse output. > ``` > > 2. If the intent is truly opt-in-only: consider removing the ServiceLoader registration and requiring explicit inclusion via [tika-config](https://issues.apache.org/jira/browse/TIKA-config).xml. That completely avoids global side effects on unrelated parsing/tests. > > (I'm just posting this, I have no opinon myself on this) THanks for the review! I made it opt-in, that was actually the intended use. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
