Nicholas DiPiazza created TIKA-4680:
---------------------------------------
Summary: tika-grpc: Add unpack/all support for extracting embedded
documents
Key: TIKA-4680
URL: https://issues.apache.org/jira/browse/TIKA-4680
Project: Tika
Issue Type: Improvement
Components: tika-pipes
Reporter: Nicholas DiPiazza
h2. Summary
The tika-grpc server currently only supports FetchAndParse, which returns
parsed text and metadata for a single document. There is no equivalent of the
REST server's {code}PUT /unpack/all{code} endpoint, which uses
RecursiveParserWrapper to extract embedded documents (attachments, slides,
worksheets) from container formats like EML, PPTX, ZIP, DOCX, etc.
This was requested by Lawrence Moorehead (elemdisc) in the context of TIKA-4679
(HTTP/2 support).
h2. Proposed Design
Add a new server-side streaming RPC to the tika-grpc service:
{code:proto}
rpc Unpack(FetchAndParseRequest) returns (stream UnpackReply) {}
message UnpackReply {
string fetch_key = 1;
string embedded_resource_path = 2; // e.g. attachment0.pdf or
word/document.xml
bytes content = 3; // raw bytes of embedded doc
map<string, string> metadata = 4; // Tika metadata for this embedded doc
string status = 5;
string error_message = 6;
}
{code}
* Server implementation uses RecursiveParserWrapper with a
ContentHandlerFactory that captures each embedded document's bytes
* Each embedded document (plus the container itself) is streamed as a separate
UnpackReply message
* Aligns with REST /unpack/all semantics
h2. References
* REST UnpackerResource:
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/UnpackerResource.java
* TIKA-4679: HTTP/2 support (sibling ticket; Lawrence's use case)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)