Hi Tim,
The code I showed is a minimal example code to show the issue I'm running
into, which is: memory keeps on growing.
In production, the loop that you see will read files off a file system and
parse them using the logic close to what I sowed. I use
contentHandler.toString() to get back the raw text so I can save it. Even
if I get ride of that call, I run into OOM.
Note that, if I test the exact same code against PDF or PPT or ODP or RTF
(I still have far more formats to test) I do *NOT* see the OOM issue even
when I increase the loop to 1000 -- memory usage remains steady and
stable. This is why in my original email I asked if there is an issue with
XML files or with my code such as if I'm missing to close / release
something.
Here is the full call stack when I get the OOM:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
at java.lang.StringBuffer.append(StringBuffer.java:114)
at java.io.StringWriter.write(StringWriter.java:106)
at
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
at
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
Thanks
Steve
On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. <[email protected]>
wrote:
> I’m not sure why you’d want to append document contents across documents
> into one handler. Typically, you’d use a new ContentHandler and new
> Metadata object for each parse. Calling “toString()” does not clear the
> content handler, and you should have 20 copies of the extracted content on
> your final loop.
>
>
>
> There shouldn’t be any difference across file types in the fact that you
> are appending a new copy of the extracted text with each loop. You might
> not be seeing the memory growth if your other file types aren’t big enough
> and if you are only doing 20 loops.
>
>
>
> But the larger question…what are you trying to accomplish?
>
>
>
> *From:* Steven White [mailto:[email protected]]
> *Sent:* Monday, February 08, 2016 1:38 PM
> *To:* [email protected]
> *Subject:* Preventing OutOfMemory exception
>
>
>
> Hi everyone,
>
>
>
> I'm integrating Tika with my application and need your help to figure out
> if the OOM I'm getting is due to the way I'm using Tika or if it is an
> issue with parsing XML files.
>
>
>
> The following example code is causing OOM on 7th iteration with -Xmx2g.
> The test will pass with -Xmx4g. The XML file I'm trying to parse is 51mb
> in size. I do not see this issue with other file types that I tested so
> far. Memory usage keeps on growing with XML file types, but stays constant
> with other file types.
>
>
>
> public class Extractor {
>
> private BodyContentHandler contentHandler = new
> BodyContentHandler(-1);
>
> private AutoDetectParser parser = new AutoDetectParser();
>
> private Metadata metadata = new Metadata();
>
>
>
> public String extract(File file) throws Exception {
>
> try {
>
> stream = TikaInputStream.get(file);
>
> parser.parse(stream, contentHandler, metadata);
>
> return contentHandler.toString();
>
> }
>
> finally {
>
> stream.close();
>
> }
>
> }
>
> }
>
>
>
> public static void main(...) {
>
> Extractor extractor = new Extractor();
>
> File file = new File("C:\\temp\\test.xml");
>
> for (int i = 0; i < 20; i++) {
>
> extractor.extract(file);
>
> }
>
>
>
> Any idea if this is an issue with XML files or if the issue in my code?
>
>
>
> Thanks
>
>
>
> Steve
>
>
>