[ 
https://issues.apache.org/jira/browse/TIKA-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077642#comment-18077642
 ] 

Nicholas DiPiazza commented on TIKA-4722:
-----------------------------------------

e2e test verified passing locally:

*Test:* {{HandlerTypeTest.testParseContextJson}} in {{tika-e2e-tests/tika-grpc}}

*Test run output:*
{code}
HTML parse status: PARSE_SUCCESS
HTML content (first 200 chars): <html xmlns="http://www.w3.org/1999/xhtml";>
<head>...

Text parse status: PARSE_SUCCESS
Text content (first 200 chars): Sample E2E Test Document Hello from Tika...

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS
{code}

The test:
# Registers a FileSystem fetcher dynamically via {{saveFetcher}} (using Ignite 
ConfigStore so the forked PipesServer can find it)
# Calls {{fetchAndParse}} with {{parse_context_json = 
{"basic-content-handler-factory": {"type": "HTML"}}}} → verifies PARSE_SUCCESS 
and HTML markup in response
# Calls {{fetchAndParse}} with {{parse_context_json = 
{"basic-content-handler-factory": {"type": "TEXT"}}}} → verifies PARSE_SUCCESS 
and no HTML tags
# Asserts that HTML and TEXT outputs differ

PR: https://github.com/apache/tika/pull/2797


> Add parse_context_json to FetchAndParseRequest for per-request ParseContext 
> configuration
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-4722
>                 URL: https://issues.apache.org/jira/browse/TIKA-4722
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Nicholas DiPiazza
>            Assignee: Nicholas DiPiazza
>            Priority: Major
>
> h2. Summary
> Add a {{parse_context_json}} field to the gRPC {{FetchAndParseRequest}} 
> message that lets callers configure any registered ParseContext component on 
> a per-request basis.
> h2. Motivation
> Downstream users of tika-grpc need to control the content output format 
> (e.g., HTML output for further processing into Markdown) on individual 
> requests, not just globally. The generic {{parse_context_json}} field enables 
> this and also exposes other ParseContext components such as timeout limits, 
> embedded document limits, and output limits.
> h2. Changes
> * *{{tika.proto}}*: add {{string parse_context_json = 5;}} to 
> {{FetchAndParseRequest}}
> * *{{TikaGrpcServerImpl.java}}*: iterate the JSON fields and call 
> {{parseContext.setJsonConfig(componentName, valueJson)}} for each entry when 
> the field is non-empty
> * *{{HandlerTypeTest.java}}* (e2e): new test verifying HTML vs TEXT output 
> via {{parse_context_json}}
> h2. Usage Example
> {code:language=json}
> {
>   "fetch_key": "my-doc.pdf",
>   "fetcher_id": "myFetcher",
>   "parse_context_json": "{\"basic-content-handler-factory\": {\"type\": 
> \"HTML\"}}"
> }
> {code}
> Available {{basic-content-handler-factory}} types: TEXT, HTML, XML, BODY, 
> IGNORE, MARKDOWN
> h2. PR
> https://github.com/apache/tika/pull/2797



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to