tika-offheap-leak

朱桂锋 Fri, 17 Mar 2023 07:30:26 -0700

Firstly,  thank you for tika project, she is great project!

Recently, i run the tika project and extract text from document， i find
java offheap is increasing until all the memory to the 100%, and then
killed by oom-killer.


then i use pmap and dump data from memory(exclude the java heap), i find
they are like this:

[ Content

Types] . xM1PK

rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK
word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK word/header2.
xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK

word/ footer1 . xm1PK

word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK word/media/
image3.pngPK word/media/imagel. jpegPK word/media/image2. jpegPK word /
theme/ theme 1. xm1PK word/settings. xm1PK

customxml/ itemProps2 .xm1PK

customXml /item2 . xm1PK docProps /custom. xm1 PK t?92
customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1
/itemProps1.xm1PK



they are office document text，why they are in offheap?  so i doubt when
parse some special  office  document  it will cause memory leak.

And sorry  i don't know what the special office document and i  can't
afford the sample.


another infomation:  when i debug code on my own mac computer, using xlsx
sample ,
when it calling tika.detect, it called ZipArchiveInputStream constructor
twice, and the same times calling java.util.zip.Inflater#end();
but when it calling tika.parseToString,  it called ZipArchiveInputStream
constructor once, but no times calling java.util.zip.Inflater#end();

Is that caused the offheap memory leak because of the Inflater use native
code?

Look forward for your reply!  thank you very much!

tika-offheap-leak

Reply via email to