Nicholas DiPiazza created TIKA-4680:
---------------------------------------

             Summary: tika-grpc: Add unpack/all support for extracting embedded 
documents
                 Key: TIKA-4680
                 URL: https://issues.apache.org/jira/browse/TIKA-4680
             Project: Tika
          Issue Type: Improvement
          Components: tika-pipes
            Reporter: Nicholas DiPiazza


h2. Summary

The tika-grpc server currently only supports FetchAndParse, which returns 
parsed text and metadata for a single document. There is no equivalent of the 
REST server's {code}PUT /unpack/all{code} endpoint, which uses 
RecursiveParserWrapper to extract embedded documents (attachments, slides, 
worksheets) from container formats like EML, PPTX, ZIP, DOCX, etc.

This was requested by Lawrence Moorehead (elemdisc) in the context of TIKA-4679 
(HTTP/2 support).

h2. Proposed Design

Add a new server-side streaming RPC to the tika-grpc service:

{code:proto}
rpc Unpack(FetchAndParseRequest) returns (stream UnpackReply) {}

message UnpackReply {
  string fetch_key = 1;
  string embedded_resource_path = 2;    // e.g. attachment0.pdf or 
word/document.xml
  bytes  content = 3;                    // raw bytes of embedded doc
  map<string, string> metadata = 4;      // Tika metadata for this embedded doc
  string status = 5;
  string error_message = 6;
}
{code}

* Server implementation uses RecursiveParserWrapper with a 
ContentHandlerFactory that captures each embedded document's bytes
* Each embedded document (plus the container itself) is streamed as a separate 
UnpackReply message
* Aligns with REST /unpack/all semantics

h2. References

* REST UnpackerResource: 
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/UnpackerResource.java
* TIKA-4679: HTTP/2 support (sibling ticket; Lawrence's use case)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to