nddipiazza opened a new pull request, #2797:
URL: https://github.com/apache/tika/pull/2797

   ## Summary
   Adds a `handler_type` field to the `FetchAndParseRequest` gRPC message, 
allowing clients to specify the output content format on a per-request basis 
without changing server configuration.
   
   JIRA: https://issues.apache.org/jira/browse/TIKA-4722
   
   ## Changes
   - **`tika.proto`**: Added `handler_type` (string, field 5) to 
`FetchAndParseRequest`
   - **`TikaGrpcServerImpl.java`**: When `handler_type` is set, creates a 
`BasicContentHandlerFactory` with the requested type and places it in the 
`ParseContext`
   - **`HandlerTypeTest.java`**: New e2e test verifying HTML output contains 
markup tags and differs from text output
   
   ## Review Focus Areas
   - Proto backward compatibility: field 5 addition is safe in proto3
   - `BasicContentHandlerFactory.parseHandlerType()` handles unrecognized 
values by falling back to TEXT
   - The `BasicContentHandlerFactory` placed in `ParseContext` is picked up by 
`ParseHandler.getContentHandlerFactory()` on the forked server side
   
   ## Critical Files
   - `tika-grpc/src/main/proto/tika.proto`
   - 
`tika-grpc/src/main/java/org/apache/tika/pipes/grpc/TikaGrpcServerImpl.java`
   - 
`tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/filesystem/HandlerTypeTest.java`
   
   ## Testing Instructions
   ```bash
   cd tika-e2e-tests/tika-grpc
   mvn test -Dtest=HandlerTypeTest -Dtika.e2e.useLocalServer=true
   ```
   
   ## Review Checklist
   - [ ] proto field number does not conflict with existing fields
   - [ ] Falls back gracefully to TEXT for unrecognized handler_type values
   - [ ] E2E test verifies HTML vs text output differs
   
   ## Potential Concerns
   - The `ContentHandlerFactory` in `ParseContext` is serialized across the IPC 
boundary to the forked `PipesServer` via Jackson Smile format — this works 
because `BasicContentHandlerFactory` is a registered parse-context component 
(`basic-content-handler-factory` in `parse-context.idx`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to