Hey! I’m trying to figure out the best way to use the /tika endpoint in tika-server. My goal is to extract all the text recursively and for images, either save them to disk or return them as e.g., base64
Right now, I’m using the plain text output, but it seems like I’ll need a more structured format. Is XHTML the only option, or is there a built-in way to get JSON instead? If not, I can probably build something to output JSON myself, please let me know if you have any pointers to where to start. As for images, I could handle this from a new OCR module, but is there a simpler way to handle this, since I would only be extracting the image binaries? Maybe by tweaking EmbeddedDocumentExtractor? Thanks, Cristi