[
https://issues.apache.org/jira/browse/TIKA-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077638#comment-18077638
]
ASF GitHub Bot commented on TIKA-4722:
--------------------------------------
nddipiazza commented on code in PR #2797:
URL: https://github.com/apache/tika/pull/2797#discussion_r3173179813
##########
tika-grpc/src/main/proto/tika.proto:
##########
@@ -100,6 +100,10 @@ message FetchAndParseRequest {
string additional_fetch_config_json = 3;
// The ID of the emitter to use (optional). If not provided, no emitter will
be used.
string emitter_id = 4;
+ // The content handler type to use for this request, overriding the server
default.
+ // Valid values: "text" (default), "html", "xml", "body", "ignore",
"markdown".
+ // Use "html" to get structured HTML output instead of plain text.
+ string handler_type = 5;
Review Comment:
@tballison I think you had intended this to be set via a parse_context_json
right? I couldn't figure out how to do it. what do you think?
> tika-grpc: Add handler_type field to FetchAndParseRequest for per-request
> content handler configuration
> -------------------------------------------------------------------------------------------------------
>
> Key: TIKA-4722
> URL: https://issues.apache.org/jira/browse/TIKA-4722
> Project: Tika
> Issue Type: New Feature
> Reporter: Nicholas DiPiazza
> Assignee: Nicholas DiPiazza
> Priority: Major
>
> h2. Summary
> Add a {{handler_type}} field to the {{FetchAndParseRequest}} gRPC message so
> callers can specify the output content handler type (e.g., {{html}},
> {{text}}, {{xml}}) on a per-request basis, without needing to change the
> server-level configuration.
> h2. Background
> The {{tika-grpc}} server currently creates a bare {{ParseContext}} for every
> parse request with no way for clients to control the output format. The
> underlying {{tika-pipes-core}} infrastructure already supports per-request
> {{ContentHandlerFactory}} via {{ParseContext}} (see
> {{ParseHandler.getContentHandlerFactory()}}), but this capability is not
> exposed through the gRPC API.
> A downstream user (Atolio) implemented this in their fork and uses it to get
> HTML output from Tika, which they then convert to Markdown. Apache Tika's
> gRPC API should expose this natively.
> h2. Proposed Change
> # Add {{handler_type}} (string, field 5) to {{FetchAndParseRequest}} in
> {{tika.proto}}
> # In {{TikaGrpcServerImpl.fetchAndParseImpl()}}, if {{handler_type}} is set,
> resolve it to a {{BasicContentHandlerFactory}} and place it in the
> {{ParseContext}}
> h2. Valid handler_type Values
> * {{text}} (default) - plain text
> * {{html}} - structured HTML output
> * {{xml}} - XHTML output
> * {{body}} - HTML body only
> * {{ignore}} - no content
> * {{markdown}} - Markdown output
> h2. Example Usage (Java gRPC client)
> {code:java}
> FetchAndParseRequest request = FetchAndParseRequest.newBuilder()
> .setFetcherId("my-fetcher")
> .setFetchKey("document.pdf")
> .setHandlerType("html")
> .build();
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)