Nicholas DiPiazza created TIKA-4722:
---------------------------------------
Summary: tika-grpc: Add handler_type field to FetchAndParseRequest
for per-request content handler configuration
Key: TIKA-4722
URL: https://issues.apache.org/jira/browse/TIKA-4722
Project: Tika
Issue Type: New Feature
Reporter: Nicholas DiPiazza
h2. Summary
Add a {{handler_type}} field to the {{FetchAndParseRequest}} gRPC message so
callers can specify the output content handler type (e.g., {{html}}, {{text}},
{{xml}}) on a per-request basis, without needing to change the server-level
configuration.
h2. Background
The {{tika-grpc}} server currently creates a bare {{ParseContext}} for every
parse request with no way for clients to control the output format. The
underlying {{tika-pipes-core}} infrastructure already supports per-request
{{ContentHandlerFactory}} via {{ParseContext}} (see
{{ParseHandler.getContentHandlerFactory()}}), but this capability is not
exposed through the gRPC API.
A downstream user (Atolio) implemented this in their fork and uses it to get
HTML output from Tika, which they then convert to Markdown. Apache Tika's gRPC
API should expose this natively.
h2. Proposed Change
# Add {{handler_type}} (string, field 5) to {{FetchAndParseRequest}} in
{{tika.proto}}
# In {{TikaGrpcServerImpl.fetchAndParseImpl()}}, if {{handler_type}} is set,
resolve it to a {{BasicContentHandlerFactory}} and place it in the
{{ParseContext}}
h2. Valid handler_type Values
* {{text}} (default) - plain text
* {{html}} - structured HTML output
* {{xml}} - XHTML output
* {{body}} - HTML body only
* {{ignore}} - no content
* {{markdown}} - Markdown output
h2. Example Usage (Java gRPC client)
{code:java}
FetchAndParseRequest request = FetchAndParseRequest.newBuilder()
.setFetcherId("my-fetcher")
.setFetchKey("document.pdf")
.setHandlerType("html")
.build();
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)