Nicholas DiPiazza created TIKA-4722:
---------------------------------------

             Summary: tika-grpc: Add handler_type field to FetchAndParseRequest 
for per-request content handler configuration
                 Key: TIKA-4722
                 URL: https://issues.apache.org/jira/browse/TIKA-4722
             Project: Tika
          Issue Type: New Feature
            Reporter: Nicholas DiPiazza


h2. Summary

Add a {{handler_type}} field to the {{FetchAndParseRequest}} gRPC message so 
callers can specify the output content handler type (e.g., {{html}}, {{text}}, 
{{xml}}) on a per-request basis, without needing to change the server-level 
configuration.

h2. Background

The {{tika-grpc}} server currently creates a bare {{ParseContext}} for every 
parse request with no way for clients to control the output format. The 
underlying {{tika-pipes-core}} infrastructure already supports per-request 
{{ContentHandlerFactory}} via {{ParseContext}} (see 
{{ParseHandler.getContentHandlerFactory()}}), but this capability is not 
exposed through the gRPC API.

A downstream user (Atolio) implemented this in their fork and uses it to get 
HTML output from Tika, which they then convert to Markdown. Apache Tika's gRPC 
API should expose this natively.

h2. Proposed Change

# Add {{handler_type}} (string, field 5) to {{FetchAndParseRequest}} in 
{{tika.proto}}
# In {{TikaGrpcServerImpl.fetchAndParseImpl()}}, if {{handler_type}} is set, 
resolve it to a {{BasicContentHandlerFactory}} and place it in the 
{{ParseContext}}

h2. Valid handler_type Values

* {{text}} (default) - plain text
* {{html}} - structured HTML output
* {{xml}} - XHTML output
* {{body}} - HTML body only
* {{ignore}} - no content
* {{markdown}} - Markdown output

h2. Example Usage (Java gRPC client)

{code:java}
FetchAndParseRequest request = FetchAndParseRequest.newBuilder()
    .setFetcherId("my-fetcher")
    .setFetchKey("document.pdf")
    .setHandlerType("html")
    .build();
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to