[jira] [Created] (TIKA-953) Tika failed to recognize non-ustar Tar file?

2012-07-13 Thread Jing Li (JIRA)
Jing Li created TIKA-953:


 Summary: Tika failed to recognize non-ustar Tar  file?
 Key: TIKA-953
 URL: https://issues.apache.org/jira/browse/TIKA-953
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
Reporter: Jing Li


The file type indeed is "POSIX tar archive (GNU)" when I use command "file" in 
linux, but Tika recognize it as "application/xhtml+xml".  The class I used with 
is DefaultDetector. 

Below is the head data of the file:

99, 102, 101, 114, 98, 114, 97, 99, 104, 101, 46, 48, 48, 54, 55, 54, 50, 55, 
57, 45, 53, 54, 54, 55, 50, 52, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 48, 48, 48, 48, 55, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 
0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 
0, 49, 49, 55, 55, 55, 49, 49, 52, 50, 48, 53, 0, 48, 49, 51, 51, 51, 49, 0, 
32, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 117, 
115, 116, 97, 114, 32, 32, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 114, 111, 111, 116, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-953) Tika failed to recognize non-ustar Tar file?

2012-07-13 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413609#comment-13413609
 ] 

Nick Burch commented on TIKA-953:
-

Any chance you could share a file that demonstrates the problem, or 
instructions on how to create one that does?

> Tika failed to recognize non-ustar Tar  file?
> -
>
> Key: TIKA-953
> URL: https://issues.apache.org/jira/browse/TIKA-953
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.1
>Reporter: Jing Li
>
> The file type indeed is "POSIX tar archive (GNU)" when I use command "file" 
> in linux, but Tika recognize it as "application/xhtml+xml".  The class I used 
> with is DefaultDetector. 
> Below is the head data of the file:
> 99, 102, 101, 114, 98, 114, 97, 99, 104, 101, 46, 48, 48, 54, 55, 54, 50, 55, 
> 57, 45, 53, 54, 54, 55, 50, 52, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 48, 48, 48, 48, 55, 48, 48, 0, 48, 48, 48, 48, 48, 48, 
> 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 
> 48, 0, 49, 49, 55, 55, 55, 49, 49, 52, 50, 48, 53, 0, 48, 49, 51, 51, 51, 49, 
> 0, 32, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 117, 115, 116, 97, 114, 32, 32, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 114, 111, 111, 
> 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file

2012-07-13 Thread Rob Tulloh (JIRA)
Rob Tulloh created TIKA-954:
---

 Summary: Tika throws OOM and GC limited exceeded on Microsoft docx 
file
 Key: TIKA-954
 URL: https://issues.apache.org/jira/browse/TIKA-954
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
 Environment: Linux (CentOS 4.x)
Reporter: Rob Tulloh


Stack trace produced with attached docx file

2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit exceeded
2012-07-13_04:45:36.86932 Dumping heap to 
/var/log/oom/content-extractor-9998.dump.1 ...
2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 
secs]
2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit exceeded
2012-07-13_04:46:57.17718   at java.lang.String.substring(String.java:1939)
2012-07-13_04:46:57.17736   at 
org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
2012-07-13_04:46:57.17750   at 
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
2012-07-13_04:46:57.17763   at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822)
2012-07-13_04:46:57.1   at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
2012-07-13_04:46:57.17793   at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
2012-07-13_04:46:57.17806   at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
2012-07-13_04:46:57.17819   at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
2012-07-13_04:46:57.17839   at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
2012-07-13_04:46:57.17853   at 
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
2012-07-13_04:46:57.17868   at 
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
2012-07-13_04:46:57.17883   at 
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
2012-07-13_04:46:57.17897   at 
org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
2012-07-13_04:46:57.17911   at 
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
2012-07-13_04:46:57.17929   at 
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
2012-07-13_04:46:57.17945   at 
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
2012-07-13_04:46:57.17962   at 
org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
 Source)
2012-07-13_04:46:57.17978   at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
2012-07-13_04:46:57.17991   at 
org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
2012-07-13_04:46:57.18004   at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116)
2012-07-13_04:46:57.18019   at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53)
2012-07-13_04:46:57.18035   at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
2012-07-13_04:46:57.18051   at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
2012-07-13_04:46:57.18066   at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
2012-07-13_04:46:57.18078   at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
2012-07-13_04:46:57.18090   at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
2012-07-13_04:46:57.18103   at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
2012-07-13_04:46:57.18115   at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
2012-07-13_04:46:57.18127   at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
2012-07-13_04:46:57.18146   at 
org.apache.tika.server.TikaResource$3.write(TikaResource.java:138)
2012-07-13_04:46:57.18158   at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117)
2012-07-13_04:46:57.18169   at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:257)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file

2012-07-13 Thread Rob Tulloh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Tulloh updated TIKA-954:


Attachment: Word.docx

The docx file that causes the error.

> Tika throws OOM and GC limited exceeded on Microsoft docx file
> --
>
> Key: TIKA-954
> URL: https://issues.apache.org/jira/browse/TIKA-954
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: Linux (CentOS 4.x)
>Reporter: Rob Tulloh
> Attachments: Word.docx
>
>
> Stack trace produced with attached docx file
> 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:45:36.86932 Dumping heap to 
> /var/log/oom/content-extractor-9998.dump.1 ...
> 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 
> secs]
> 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:46:57.17718   at 
> java.lang.String.substring(String.java:1939)
> 2012-07-13_04:46:57.17736   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
> 2012-07-13_04:46:57.17750   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
> 2012-07-13_04:46:57.17763   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822)
> 2012-07-13_04:46:57.1   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
> 2012-07-13_04:46:57.17793   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
> 2012-07-13_04:46:57.17806   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
> 2012-07-13_04:46:57.17819   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
> 2012-07-13_04:46:57.17839   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
> 2012-07-13_04:46:57.17853   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
> 2012-07-13_04:46:57.17868   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
> 2012-07-13_04:46:57.17883   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
> 2012-07-13_04:46:57.17897   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
> 2012-07-13_04:46:57.17911   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
> 2012-07-13_04:46:57.17929   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
> 2012-07-13_04:46:57.17945   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> 2012-07-13_04:46:57.17962   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> 2012-07-13_04:46:57.17978   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
> 2012-07-13_04:46:57.17991   at 
> org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
> 2012-07-13_04:46:57.18004   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116)
> 2012-07-13_04:46:57.18019   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53)
> 2012-07-13_04:46:57.18035   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
> 2012-07-13_04:46:57.18051   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
> 2012-07-13_04:46:57.18066   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
> 2012-07-13_04:46:57.18078   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18090   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> 2012-07-13_04:46:57.18103   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18115   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 2012-07-13_04:46:57.18127   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
> 2012-07-13_04:46:57.18146   at 
> org.apache.tika.server.TikaResource$3.write(TikaResource.java:138)
> 2012-07-13_04:46:57.18158   at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117)
> 2012-07-13_04:46:57.18169   at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:257)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please

[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file

2012-07-13 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413850#comment-13413850
 ] 

Rob Tulloh commented on TIKA-954:
-

> curl -v -T Word.docx http://localhost:9998/tika
* About to connect() to localhost port 9998
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 9998
> PUT /tika HTTP/1.1
> User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 
> OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> Host: localhost:9998
> Accept: */*
> Content-Length: 4543821
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
Empty reply from server
* Connection #0 to host localhost left intact
curl: (52) Empty reply from server
* Closing connection #0


> Tika throws OOM and GC limited exceeded on Microsoft docx file
> --
>
> Key: TIKA-954
> URL: https://issues.apache.org/jira/browse/TIKA-954
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: Linux (CentOS 4.x)
>Reporter: Rob Tulloh
> Attachments: Word.docx
>
>
> Stack trace produced with attached docx file
> 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:45:36.86932 Dumping heap to 
> /var/log/oom/content-extractor-9998.dump.1 ...
> 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 
> secs]
> 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:46:57.17718   at 
> java.lang.String.substring(String.java:1939)
> 2012-07-13_04:46:57.17736   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
> 2012-07-13_04:46:57.17750   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
> 2012-07-13_04:46:57.17763   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822)
> 2012-07-13_04:46:57.1   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
> 2012-07-13_04:46:57.17793   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
> 2012-07-13_04:46:57.17806   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
> 2012-07-13_04:46:57.17819   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
> 2012-07-13_04:46:57.17839   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
> 2012-07-13_04:46:57.17853   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
> 2012-07-13_04:46:57.17868   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
> 2012-07-13_04:46:57.17883   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
> 2012-07-13_04:46:57.17897   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
> 2012-07-13_04:46:57.17911   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
> 2012-07-13_04:46:57.17929   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
> 2012-07-13_04:46:57.17945   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> 2012-07-13_04:46:57.17962   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> 2012-07-13_04:46:57.17978   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
> 2012-07-13_04:46:57.17991   at 
> org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
> 2012-07-13_04:46:57.18004   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116)
> 2012-07-13_04:46:57.18019   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53)
> 2012-07-13_04:46:57.18035   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
> 2012-07-13_04:46:57.18051   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
> 2012-07-13_04:46:57.18066   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
> 2012-07-13_04:46:57.18078   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18090   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> 2012-07-13_04:46:57.18103   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18115   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 2012-07-13_04:46:57.18127   at 
> org

[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file

2012-07-13 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413863#comment-13413863
 ] 

Nick Burch commented on TIKA-954:
-

How much memory are you giving to the Tika process? Did you try increasing it?

> Tika throws OOM and GC limited exceeded on Microsoft docx file
> --
>
> Key: TIKA-954
> URL: https://issues.apache.org/jira/browse/TIKA-954
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: Linux (CentOS 4.x)
>Reporter: Rob Tulloh
> Attachments: Word.docx
>
>
> Stack trace produced with attached docx file
> 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:45:36.86932 Dumping heap to 
> /var/log/oom/content-extractor-9998.dump.1 ...
> 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 
> secs]
> 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:46:57.17718   at 
> java.lang.String.substring(String.java:1939)
> 2012-07-13_04:46:57.17736   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
> 2012-07-13_04:46:57.17750   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
> 2012-07-13_04:46:57.17763   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822)
> 2012-07-13_04:46:57.1   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
> 2012-07-13_04:46:57.17793   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
> 2012-07-13_04:46:57.17806   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
> 2012-07-13_04:46:57.17819   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
> 2012-07-13_04:46:57.17839   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
> 2012-07-13_04:46:57.17853   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
> 2012-07-13_04:46:57.17868   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
> 2012-07-13_04:46:57.17883   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
> 2012-07-13_04:46:57.17897   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
> 2012-07-13_04:46:57.17911   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
> 2012-07-13_04:46:57.17929   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
> 2012-07-13_04:46:57.17945   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> 2012-07-13_04:46:57.17962   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> 2012-07-13_04:46:57.17978   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
> 2012-07-13_04:46:57.17991   at 
> org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
> 2012-07-13_04:46:57.18004   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116)
> 2012-07-13_04:46:57.18019   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53)
> 2012-07-13_04:46:57.18035   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
> 2012-07-13_04:46:57.18051   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
> 2012-07-13_04:46:57.18066   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
> 2012-07-13_04:46:57.18078   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18090   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> 2012-07-13_04:46:57.18103   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18115   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 2012-07-13_04:46:57.18127   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
> 2012-07-13_04:46:57.18146   at 
> org.apache.tika.server.TikaResource$3.write(TikaResource.java:138)
> 2012-07-13_04:46:57.18158   at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117)
> 2012-07-13_04:46:57.18169   at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:257)

--
This message is autom

[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file

2012-07-13 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413904#comment-13413904
 ] 

Rob Tulloh commented on TIKA-954:
-

We have been running with 600M. We are now increasing the memory size to see 
what happens.

> Tika throws OOM and GC limited exceeded on Microsoft docx file
> --
>
> Key: TIKA-954
> URL: https://issues.apache.org/jira/browse/TIKA-954
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: Linux (CentOS 4.x)
>Reporter: Rob Tulloh
> Attachments: Word.docx
>
>
> Stack trace produced with attached docx file
> 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:45:36.86932 Dumping heap to 
> /var/log/oom/content-extractor-9998.dump.1 ...
> 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 
> secs]
> 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:46:57.17718   at 
> java.lang.String.substring(String.java:1939)
> 2012-07-13_04:46:57.17736   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
> 2012-07-13_04:46:57.17750   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
> 2012-07-13_04:46:57.17763   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822)
> 2012-07-13_04:46:57.1   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
> 2012-07-13_04:46:57.17793   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
> 2012-07-13_04:46:57.17806   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
> 2012-07-13_04:46:57.17819   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
> 2012-07-13_04:46:57.17839   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
> 2012-07-13_04:46:57.17853   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
> 2012-07-13_04:46:57.17868   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
> 2012-07-13_04:46:57.17883   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
> 2012-07-13_04:46:57.17897   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
> 2012-07-13_04:46:57.17911   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
> 2012-07-13_04:46:57.17929   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
> 2012-07-13_04:46:57.17945   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> 2012-07-13_04:46:57.17962   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> 2012-07-13_04:46:57.17978   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
> 2012-07-13_04:46:57.17991   at 
> org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
> 2012-07-13_04:46:57.18004   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116)
> 2012-07-13_04:46:57.18019   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53)
> 2012-07-13_04:46:57.18035   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
> 2012-07-13_04:46:57.18051   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
> 2012-07-13_04:46:57.18066   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
> 2012-07-13_04:46:57.18078   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18090   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> 2012-07-13_04:46:57.18103   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18115   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 2012-07-13_04:46:57.18127   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
> 2012-07-13_04:46:57.18146   at 
> org.apache.tika.server.TikaResource$3.write(TikaResource.java:138)
> 2012-07-13_04:46:57.18158   at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117)
> 2012-07-13_04:46:57.18169   at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:257)

--
This mes

[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file

2012-07-13 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414076#comment-13414076
 ] 

Rob Tulloh commented on TIKA-954:
-

Turns out we are running on CentOS 5.x so I can test with bigger JVM sizes. At 
1G, still get OOM. At 2G, did not see OOM, but got no response from the server.

> Tika throws OOM and GC limited exceeded on Microsoft docx file
> --
>
> Key: TIKA-954
> URL: https://issues.apache.org/jira/browse/TIKA-954
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: Linux (CentOS 4.x)
>Reporter: Rob Tulloh
> Attachments: Word.docx
>
>
> Stack trace produced with attached docx file
> 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:45:36.86932 Dumping heap to 
> /var/log/oom/content-extractor-9998.dump.1 ...
> 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 
> secs]
> 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:46:57.17718   at 
> java.lang.String.substring(String.java:1939)
> 2012-07-13_04:46:57.17736   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
> 2012-07-13_04:46:57.17750   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
> 2012-07-13_04:46:57.17763   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822)
> 2012-07-13_04:46:57.1   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
> 2012-07-13_04:46:57.17793   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
> 2012-07-13_04:46:57.17806   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
> 2012-07-13_04:46:57.17819   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
> 2012-07-13_04:46:57.17839   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
> 2012-07-13_04:46:57.17853   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
> 2012-07-13_04:46:57.17868   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
> 2012-07-13_04:46:57.17883   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
> 2012-07-13_04:46:57.17897   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
> 2012-07-13_04:46:57.17911   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
> 2012-07-13_04:46:57.17929   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
> 2012-07-13_04:46:57.17945   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> 2012-07-13_04:46:57.17962   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> 2012-07-13_04:46:57.17978   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
> 2012-07-13_04:46:57.17991   at 
> org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
> 2012-07-13_04:46:57.18004   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116)
> 2012-07-13_04:46:57.18019   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53)
> 2012-07-13_04:46:57.18035   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
> 2012-07-13_04:46:57.18051   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
> 2012-07-13_04:46:57.18066   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
> 2012-07-13_04:46:57.18078   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18090   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> 2012-07-13_04:46:57.18103   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18115   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 2012-07-13_04:46:57.18127   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
> 2012-07-13_04:46:57.18146   at 
> org.apache.tika.server.TikaResource$3.write(TikaResource.java:138)
> 2012-07-13_04:46:57.18158   at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117)
> 2012-07-13_04:46:57.18169   at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInt

[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file

2012-07-13 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414118#comment-13414118
 ] 

Rob Tulloh commented on TIKA-954:
-

We can provide you the JVM heap dump if you think that is useful. Could be a 
memory leak of some kind due to the GC limit exceeded message being produced.

> Tika throws OOM and GC limited exceeded on Microsoft docx file
> --
>
> Key: TIKA-954
> URL: https://issues.apache.org/jira/browse/TIKA-954
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: Linux (CentOS 4.x)
>Reporter: Rob Tulloh
> Attachments: Word.docx
>
>
> Stack trace produced with attached docx file
> 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:45:36.86932 Dumping heap to 
> /var/log/oom/content-extractor-9998.dump.1 ...
> 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 
> secs]
> 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:46:57.17718   at 
> java.lang.String.substring(String.java:1939)
> 2012-07-13_04:46:57.17736   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
> 2012-07-13_04:46:57.17750   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
> 2012-07-13_04:46:57.17763   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822)
> 2012-07-13_04:46:57.1   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
> 2012-07-13_04:46:57.17793   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
> 2012-07-13_04:46:57.17806   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
> 2012-07-13_04:46:57.17819   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
> 2012-07-13_04:46:57.17839   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
> 2012-07-13_04:46:57.17853   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
> 2012-07-13_04:46:57.17868   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
> 2012-07-13_04:46:57.17883   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
> 2012-07-13_04:46:57.17897   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
> 2012-07-13_04:46:57.17911   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
> 2012-07-13_04:46:57.17929   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
> 2012-07-13_04:46:57.17945   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> 2012-07-13_04:46:57.17962   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> 2012-07-13_04:46:57.17978   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
> 2012-07-13_04:46:57.17991   at 
> org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
> 2012-07-13_04:46:57.18004   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116)
> 2012-07-13_04:46:57.18019   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53)
> 2012-07-13_04:46:57.18035   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
> 2012-07-13_04:46:57.18051   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
> 2012-07-13_04:46:57.18066   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
> 2012-07-13_04:46:57.18078   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18090   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> 2012-07-13_04:46:57.18103   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18115   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 2012-07-13_04:46:57.18127   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
> 2012-07-13_04:46:57.18146   at 
> org.apache.tika.server.TikaResource$3.write(TikaResource.java:138)
> 2012-07-13_04:46:57.18158   at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117)
> 2012-07-13_04:46:57.18169   at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterce

[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file

2012-07-13 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414296#comment-13414296
 ] 

Rob Tulloh commented on TIKA-954:
-

We bumped the JVM size to 2 GB. We now get an empty reply from the call. Here 
is what tika reported in the log file. What I cannot tell is if this is a 
limitation of the server or perhaps curl. I am tempted to believe it is the 
server rather than curl. The document in question appears to be 3000+ pages of 
text.

2012-07-14_00:17:40.15182 INFO: tika/12345/Word.docx (autodetecting type)
2012-07-14_01:04:14.43799 Jul 13, 2012 8:04:12 PM 
org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse
t South Africa in 2000 on my unhappy first senior England tour."
2012-07-14_01:04:14.75706 Jul 13, 2012 8:04:12 PM 
org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogging
 unwinding now
2012-07-14_01:04:14.75707 org.apache.cxf.interceptor.Fault: Could not send 
Message.
dleMessage(MessageSenderInterceptor.java:64)
2012-07-14_01:04:14.75709   at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:263)
ptor.java:77)
2012-07-14_01:04:14.75710   at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:263)
a:123)
nation.java:323)
n.java:289)
2012-07-14_01:04:14.76707   at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:72)
2012-07-14_01:04:14.76707   at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:943)
2012-07-14_01:04:14.76708   at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:879)
2012-07-14_01:04:14.76708   at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
ion.java:250)
2012-07-14_01:04:14.76709   at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
2012-07-14_01:04:14.76709   at 
org.eclipse.jetty.server.Server.handle(Server.java:345)
2012-07-14_01:04:14.76710   at 
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:441)
ava:919)
2012-07-14_01:04:14.76712   at 
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:588)
2012-07-14_01:04:14.76712   at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:218)
2012-07-14_01:04:14.76714   at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51)
2012-07-14_01:04:14.76714   at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586)
2012-07-14_01:04:14.76715   at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44)
2012-07-14_01:04:14.76715   at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598)
2012-07-14_01:04:14.76716   at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533)
2012-07-14_01:04:14.76716   at java.lang.Thread.run(Thread.java:662)
2012-07-14_01:04:14.76716 Caused by: org.eclipse.jetty.io.EofException
2012-07-14_01:04:14.76717   at 
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:921)
2012-07-14_01:04:14.76717   at 
org.eclipse.jetty.server.HttpConnection.flushResponse(HttpConnection.java:612)
2012-07-14_01:04:14.76718   at 
org.eclipse.jetty.server.HttpConnection$Output.close(HttpConnection.java:995)
2012-07-14_01:04:14.76718   at 
org.apache.cxf.transport.http.AbstractHTTPDestination$WrappedOutputStream.close(AbstractHTTPDestination.java:650)
2012-07-14_01:04:14.76720   at 
org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56)
2012-07-14_01:04:14.76721   at 
org.apache.cxf.transport.http.AbstractHTTPDestination$BackChannelConduit.close(AbstractHTTPDestination.java:593)
2012-07-14_01:04:14.76721   at 
org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:62)
2012-07-14_01:04:14.76722   ... 23 more
2012-07-14_01:04:14.76722 Caused by: java.nio.channels.ClosedChannelException
2012-07-14_01:04:14.76722   at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:135)
2012-07-14_01:04:14.76724   at 
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:357)
2012-07-14_01:04:14.76724   at 
java.nio.channels.SocketChannel.write(SocketChannel.java:360)
2012-07-14_01:04:14.76725   at 
org.eclipse.jetty.io.nio.ChannelEndPoint.gatheringFlush(ChannelEndPoint.java:354)
2012-07-14_01:04:14.76725   at 
org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:292)
2012-07-14_01:04:14.76725   at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:300)
2012-07-14_01:04:14.76726   at 
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:848)
2012-07-14_01:04:14.76726   ... 29 more
2012-07-14_01:04:14.76727 Jul

[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file

2012-07-13 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414297#comment-13414297
 ] 

Rob Tulloh commented on TIKA-954:
-

curl output:

* Connected to localhost (127.0.0.1) port 9998
> PUT /tika/12345/Word.docx HTTP/1.1
> User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 
> OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> Host: localhost:9998
> Accept: */*
> Content-Type: application/octet-stream
> Content-Length: 4543821
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 4437k0 0  100 4437k  0  12612  0:06:00  0:06:00 --:--:-- 
0Empty reply from server
100 4437k0 0  100 4437k  0  12612  0:06:00  0:06:00 --:--:-- 0* 
Connection #0 to host localhost left intact

curl: (52) Empty reply from server
* Closing connection #0



> Tika throws OOM and GC limited exceeded on Microsoft docx file
> --
>
> Key: TIKA-954
> URL: https://issues.apache.org/jira/browse/TIKA-954
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: Linux (CentOS 4.x)
>Reporter: Rob Tulloh
> Attachments: Word.docx
>
>
> Stack trace produced with attached docx file
> 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:45:36.86932 Dumping heap to 
> /var/log/oom/content-extractor-9998.dump.1 ...
> 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 
> secs]
> 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
> 2012-07-13_04:46:57.17718   at 
> java.lang.String.substring(String.java:1939)
> 2012-07-13_04:46:57.17736   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
> 2012-07-13_04:46:57.17750   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
> 2012-07-13_04:46:57.17763   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822)
> 2012-07-13_04:46:57.1   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
> 2012-07-13_04:46:57.17793   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
> 2012-07-13_04:46:57.17806   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
> 2012-07-13_04:46:57.17819   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
> 2012-07-13_04:46:57.17839   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
> 2012-07-13_04:46:57.17853   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
> 2012-07-13_04:46:57.17868   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
> 2012-07-13_04:46:57.17883   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
> 2012-07-13_04:46:57.17897   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
> 2012-07-13_04:46:57.17911   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
> 2012-07-13_04:46:57.17929   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
> 2012-07-13_04:46:57.17945   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> 2012-07-13_04:46:57.17962   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> 2012-07-13_04:46:57.17978   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
> 2012-07-13_04:46:57.17991   at 
> org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
> 2012-07-13_04:46:57.18004   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116)
> 2012-07-13_04:46:57.18019   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53)
> 2012-07-13_04:46:57.18035   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
> 2012-07-13_04:46:57.18051   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
> 2012-07-13_04:46:57.18066   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
> 2012-07-13_04:46:57.18078   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 2012-07-13_04:46:57.18090   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> 2012