nddipiazza opened a new pull request, #2797: URL: https://github.com/apache/tika/pull/2797
## Summary Adds a `handler_type` field to the `FetchAndParseRequest` gRPC message, allowing clients to specify the output content format on a per-request basis without changing server configuration. JIRA: https://issues.apache.org/jira/browse/TIKA-4722 ## Changes - **`tika.proto`**: Added `handler_type` (string, field 5) to `FetchAndParseRequest` - **`TikaGrpcServerImpl.java`**: When `handler_type` is set, creates a `BasicContentHandlerFactory` with the requested type and places it in the `ParseContext` - **`HandlerTypeTest.java`**: New e2e test verifying HTML output contains markup tags and differs from text output ## Review Focus Areas - Proto backward compatibility: field 5 addition is safe in proto3 - `BasicContentHandlerFactory.parseHandlerType()` handles unrecognized values by falling back to TEXT - The `BasicContentHandlerFactory` placed in `ParseContext` is picked up by `ParseHandler.getContentHandlerFactory()` on the forked server side ## Critical Files - `tika-grpc/src/main/proto/tika.proto` - `tika-grpc/src/main/java/org/apache/tika/pipes/grpc/TikaGrpcServerImpl.java` - `tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/filesystem/HandlerTypeTest.java` ## Testing Instructions ```bash cd tika-e2e-tests/tika-grpc mvn test -Dtest=HandlerTypeTest -Dtika.e2e.useLocalServer=true ``` ## Review Checklist - [ ] proto field number does not conflict with existing fields - [ ] Falls back gracefully to TEXT for unrecognized handler_type values - [ ] E2E test verifies HTML vs text output differs ## Potential Concerns - The `ContentHandlerFactory` in `ParseContext` is serialized across the IPC boundary to the forked `PipesServer` via Jackson Smile format — this works because `BasicContentHandlerFactory` is a registered parse-context component (`basic-content-handler-factory` in `parse-context.idx`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
