Firstly, thank you for tika project, she is great project! Recently, i run the tika project and extract text from document, i find java offheap is increasing until all the memory to the 100%, and then killed by oom-killer.
then i use pmap and dump data from memory(exclude the java heap), i find they are like this: [ Content Types] . xM1PK rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK word/header2. xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK word/ footer1 . xm1PK word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK word/media/ image3.pngPK word/media/imagel. jpegPK word/media/image2. jpegPK word / theme/ theme 1. xm1PK word/settings. xm1PK customxml/ itemProps2 .xm1PK customXml /item2 . xm1PK docProps /custom. xm1 PK t?92 customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1 /itemProps1.xm1PK they are office document text,why they are in offheap? so i doubt when parse some special office document it will cause memory leak. And sorry i don't know what the special office document and i can't afford the sample. another infomation: when i debug code on my own mac computer, using xlsx sample , when it calling tika.detect, it called ZipArchiveInputStream constructor twice, and the same times calling java.util.zip.Inflater#end(); but when it calling tika.parseToString, it called ZipArchiveInputStream constructor once, but no times calling java.util.zip.Inflater#end(); Is that caused the offheap memory leak because of the Inflater use native code? Look forward for your reply! thank you very much!