[
https://issues.apache.org/jira/browse/TIKA-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077639#comment-18077639
]
ASF GitHub Bot commented on TIKA-4722:
--------------------------------------
nddipiazza commented on code in PR #2797:
URL: https://github.com/apache/tika/pull/2797#discussion_r3173198813
##########
tika-grpc/src/main/proto/tika.proto:
##########
@@ -100,6 +100,10 @@ message FetchAndParseRequest {
string additional_fetch_config_json = 3;
// The ID of the emitter to use (optional). If not provided, no emitter will
be used.
string emitter_id = 4;
+ // The content handler type to use for this request, overriding the server
default.
+ // Valid values: "text" (default), "html", "xml", "body", "ignore",
"markdown".
+ // Use "html" to get structured HTML output instead of plain text.
+ string handler_type = 5;
Review Comment:
nm. claude figured it out.
> Add parse_context_json to FetchAndParseRequest for per-request ParseContext
> configuration
> -----------------------------------------------------------------------------------------
>
> Key: TIKA-4722
> URL: https://issues.apache.org/jira/browse/TIKA-4722
> Project: Tika
> Issue Type: New Feature
> Reporter: Nicholas DiPiazza
> Assignee: Nicholas DiPiazza
> Priority: Major
>
> h2. Summary
> Add a {{parse_context_json}} field to the gRPC {{FetchAndParseRequest}}
> message that lets callers configure any registered ParseContext component on
> a per-request basis.
> h2. Motivation
> Downstream users of tika-grpc need to control the content output format
> (e.g., HTML output for further processing into Markdown) on individual
> requests, not just globally. The generic {{parse_context_json}} field enables
> this and also exposes other ParseContext components such as timeout limits,
> embedded document limits, and output limits.
> h2. Changes
> * *{{tika.proto}}*: add {{string parse_context_json = 5;}} to
> {{FetchAndParseRequest}}
> * *{{TikaGrpcServerImpl.java}}*: iterate the JSON fields and call
> {{parseContext.setJsonConfig(componentName, valueJson)}} for each entry when
> the field is non-empty
> * *{{HandlerTypeTest.java}}* (e2e): new test verifying HTML vs TEXT output
> via {{parse_context_json}}
> h2. Usage Example
> {code:language=json}
> {
> "fetch_key": "my-doc.pdf",
> "fetcher_id": "myFetcher",
> "parse_context_json": "{\"basic-content-handler-factory\": {\"type\":
> \"HTML\"}}"
> }
> {code}
> Available {{basic-content-handler-factory}} types: TEXT, HTML, XML, BODY,
> IGNORE, MARKDOWN
> h2. PR
> https://github.com/apache/tika/pull/2797
--
This message was sent by Atlassian Jira
(v8.20.10#820010)