[jira] [Created] (TIKA-953) Tika failed to recognize non-ustar Tar file?
Jing Li created TIKA-953: Summary: Tika failed to recognize non-ustar Tar file? Key: TIKA-953 URL: https://issues.apache.org/jira/browse/TIKA-953 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Reporter: Jing Li The file type indeed is "POSIX tar archive (GNU)" when I use command "file" in linux, but Tika recognize it as "application/xhtml+xml". The class I used with is DefaultDetector. Below is the head data of the file: 99, 102, 101, 114, 98, 114, 97, 99, 104, 101, 46, 48, 48, 54, 55, 54, 50, 55, 57, 45, 53, 54, 54, 55, 50, 52, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 48, 48, 48, 48, 55, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 0, 49, 49, 55, 55, 55, 49, 49, 52, 50, 48, 53, 0, 48, 49, 51, 51, 51, 49, 0, 32, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 117, 115, 116, 97, 114, 32, 32, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-953) Tika failed to recognize non-ustar Tar file?
[ https://issues.apache.org/jira/browse/TIKA-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413609#comment-13413609 ] Nick Burch commented on TIKA-953: - Any chance you could share a file that demonstrates the problem, or instructions on how to create one that does? > Tika failed to recognize non-ustar Tar file? > - > > Key: TIKA-953 > URL: https://issues.apache.org/jira/browse/TIKA-953 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.1 >Reporter: Jing Li > > The file type indeed is "POSIX tar archive (GNU)" when I use command "file" > in linux, but Tika recognize it as "application/xhtml+xml". The class I used > with is DefaultDetector. > Below is the head data of the file: > 99, 102, 101, 114, 98, 114, 97, 99, 104, 101, 46, 48, 48, 54, 55, 54, 50, 55, > 57, 45, 53, 54, 54, 55, 50, 52, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 48, 48, 48, 48, 55, 48, 48, 0, 48, 48, 48, 48, 48, 48, > 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, > 48, 0, 49, 49, 55, 55, 55, 49, 49, 52, 50, 48, 53, 0, 48, 49, 51, 51, 51, 49, > 0, 32, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 117, 115, 116, 97, 114, 32, 32, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 114, 111, 111, > 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file
Rob Tulloh created TIKA-954: --- Summary: Tika throws OOM and GC limited exceeded on Microsoft docx file Key: TIKA-954 URL: https://issues.apache.org/jira/browse/TIKA-954 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Environment: Linux (CentOS 4.x) Reporter: Rob Tulloh Stack trace produced with attached docx file 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit exceeded 2012-07-13_04:45:36.86932 Dumping heap to /var/log/oom/content-extractor-9998.dump.1 ... 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 secs] 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit exceeded 2012-07-13_04:46:57.17718 at java.lang.String.substring(String.java:1939) 2012-07-13_04:46:57.17736 at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254) 2012-07-13_04:46:57.17750 at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) 2012-07-13_04:46:57.17763 at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822) 2012-07-13_04:46:57.1 at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521) 2012-07-13_04:46:57.17793 at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362) 2012-07-13_04:46:57.17806 at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293) 2012-07-13_04:46:57.17819 at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261) 2012-07-13_04:46:57.17839 at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808) 2012-07-13_04:46:57.17853 at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290) 2012-07-13_04:46:57.17868 at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400) 2012-07-13_04:46:57.17883 at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714) 2012-07-13_04:46:57.17897 at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439) 2012-07-13_04:46:57.17911 at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) 2012-07-13_04:46:57.17929 at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) 2012-07-13_04:46:57.17945 at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) 2012-07-13_04:46:57.17962 at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source) 2012-07-13_04:46:57.17978 at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134) 2012-07-13_04:46:57.17991 at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) 2012-07-13_04:46:57.18004 at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116) 2012-07-13_04:46:57.18019 at org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53) 2012-07-13_04:46:57.18035 at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) 2012-07-13_04:46:57.18051 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87) 2012-07-13_04:46:57.18066 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) 2012-07-13_04:46:57.18078 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 2012-07-13_04:46:57.18090 at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) 2012-07-13_04:46:57.18103 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 2012-07-13_04:46:57.18115 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 2012-07-13_04:46:57.18127 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) 2012-07-13_04:46:57.18146 at org.apache.tika.server.TikaResource$3.write(TikaResource.java:138) 2012-07-13_04:46:57.18158 at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117) 2012-07-13_04:46:57.18169 at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:257) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file
[ https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Tulloh updated TIKA-954: Attachment: Word.docx The docx file that causes the error. > Tika throws OOM and GC limited exceeded on Microsoft docx file > -- > > Key: TIKA-954 > URL: https://issues.apache.org/jira/browse/TIKA-954 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: Linux (CentOS 4.x) >Reporter: Rob Tulloh > Attachments: Word.docx > > > Stack trace produced with attached docx file > 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:45:36.86932 Dumping heap to > /var/log/oom/content-extractor-9998.dump.1 ... > 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 > secs] > 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:46:57.17718 at > java.lang.String.substring(String.java:1939) > 2012-07-13_04:46:57.17736 at > org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254) > 2012-07-13_04:46:57.17750 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) > 2012-07-13_04:46:57.17763 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822) > 2012-07-13_04:46:57.1 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521) > 2012-07-13_04:46:57.17793 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362) > 2012-07-13_04:46:57.17806 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293) > 2012-07-13_04:46:57.17819 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261) > 2012-07-13_04:46:57.17839 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808) > 2012-07-13_04:46:57.17853 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290) > 2012-07-13_04:46:57.17868 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400) > 2012-07-13_04:46:57.17883 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714) > 2012-07-13_04:46:57.17897 at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439) > 2012-07-13_04:46:57.17911 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > 2012-07-13_04:46:57.17929 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > 2012-07-13_04:46:57.17945 at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > 2012-07-13_04:46:57.17962 at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > 2012-07-13_04:46:57.17978 at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134) > 2012-07-13_04:46:57.17991 at > org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) > 2012-07-13_04:46:57.18004 at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116) > 2012-07-13_04:46:57.18019 at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53) > 2012-07-13_04:46:57.18035 at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) > 2012-07-13_04:46:57.18051 at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87) > 2012-07-13_04:46:57.18066 at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) > 2012-07-13_04:46:57.18078 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18090 at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > 2012-07-13_04:46:57.18103 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18115 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > 2012-07-13_04:46:57.18127 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) > 2012-07-13_04:46:57.18146 at > org.apache.tika.server.TikaResource$3.write(TikaResource.java:138) > 2012-07-13_04:46:57.18158 at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117) > 2012-07-13_04:46:57.18169 at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:257) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please
[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file
[ https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413850#comment-13413850 ] Rob Tulloh commented on TIKA-954: - > curl -v -T Word.docx http://localhost:9998/tika * About to connect() to localhost port 9998 * Trying 127.0.0.1... connected * Connected to localhost (127.0.0.1) port 9998 > PUT /tika HTTP/1.1 > User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 > OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5 > Host: localhost:9998 > Accept: */* > Content-Length: 4543821 > Expect: 100-continue > < HTTP/1.1 100 Continue Empty reply from server * Connection #0 to host localhost left intact curl: (52) Empty reply from server * Closing connection #0 > Tika throws OOM and GC limited exceeded on Microsoft docx file > -- > > Key: TIKA-954 > URL: https://issues.apache.org/jira/browse/TIKA-954 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: Linux (CentOS 4.x) >Reporter: Rob Tulloh > Attachments: Word.docx > > > Stack trace produced with attached docx file > 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:45:36.86932 Dumping heap to > /var/log/oom/content-extractor-9998.dump.1 ... > 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 > secs] > 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:46:57.17718 at > java.lang.String.substring(String.java:1939) > 2012-07-13_04:46:57.17736 at > org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254) > 2012-07-13_04:46:57.17750 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) > 2012-07-13_04:46:57.17763 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822) > 2012-07-13_04:46:57.1 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521) > 2012-07-13_04:46:57.17793 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362) > 2012-07-13_04:46:57.17806 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293) > 2012-07-13_04:46:57.17819 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261) > 2012-07-13_04:46:57.17839 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808) > 2012-07-13_04:46:57.17853 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290) > 2012-07-13_04:46:57.17868 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400) > 2012-07-13_04:46:57.17883 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714) > 2012-07-13_04:46:57.17897 at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439) > 2012-07-13_04:46:57.17911 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > 2012-07-13_04:46:57.17929 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > 2012-07-13_04:46:57.17945 at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > 2012-07-13_04:46:57.17962 at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > 2012-07-13_04:46:57.17978 at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134) > 2012-07-13_04:46:57.17991 at > org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) > 2012-07-13_04:46:57.18004 at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116) > 2012-07-13_04:46:57.18019 at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53) > 2012-07-13_04:46:57.18035 at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) > 2012-07-13_04:46:57.18051 at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87) > 2012-07-13_04:46:57.18066 at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) > 2012-07-13_04:46:57.18078 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18090 at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > 2012-07-13_04:46:57.18103 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18115 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > 2012-07-13_04:46:57.18127 at > org
[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file
[ https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413863#comment-13413863 ] Nick Burch commented on TIKA-954: - How much memory are you giving to the Tika process? Did you try increasing it? > Tika throws OOM and GC limited exceeded on Microsoft docx file > -- > > Key: TIKA-954 > URL: https://issues.apache.org/jira/browse/TIKA-954 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: Linux (CentOS 4.x) >Reporter: Rob Tulloh > Attachments: Word.docx > > > Stack trace produced with attached docx file > 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:45:36.86932 Dumping heap to > /var/log/oom/content-extractor-9998.dump.1 ... > 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 > secs] > 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:46:57.17718 at > java.lang.String.substring(String.java:1939) > 2012-07-13_04:46:57.17736 at > org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254) > 2012-07-13_04:46:57.17750 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) > 2012-07-13_04:46:57.17763 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822) > 2012-07-13_04:46:57.1 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521) > 2012-07-13_04:46:57.17793 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362) > 2012-07-13_04:46:57.17806 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293) > 2012-07-13_04:46:57.17819 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261) > 2012-07-13_04:46:57.17839 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808) > 2012-07-13_04:46:57.17853 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290) > 2012-07-13_04:46:57.17868 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400) > 2012-07-13_04:46:57.17883 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714) > 2012-07-13_04:46:57.17897 at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439) > 2012-07-13_04:46:57.17911 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > 2012-07-13_04:46:57.17929 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > 2012-07-13_04:46:57.17945 at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > 2012-07-13_04:46:57.17962 at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > 2012-07-13_04:46:57.17978 at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134) > 2012-07-13_04:46:57.17991 at > org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) > 2012-07-13_04:46:57.18004 at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116) > 2012-07-13_04:46:57.18019 at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53) > 2012-07-13_04:46:57.18035 at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) > 2012-07-13_04:46:57.18051 at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87) > 2012-07-13_04:46:57.18066 at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) > 2012-07-13_04:46:57.18078 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18090 at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > 2012-07-13_04:46:57.18103 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18115 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > 2012-07-13_04:46:57.18127 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) > 2012-07-13_04:46:57.18146 at > org.apache.tika.server.TikaResource$3.write(TikaResource.java:138) > 2012-07-13_04:46:57.18158 at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117) > 2012-07-13_04:46:57.18169 at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:257) -- This message is autom
[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file
[ https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413904#comment-13413904 ] Rob Tulloh commented on TIKA-954: - We have been running with 600M. We are now increasing the memory size to see what happens. > Tika throws OOM and GC limited exceeded on Microsoft docx file > -- > > Key: TIKA-954 > URL: https://issues.apache.org/jira/browse/TIKA-954 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: Linux (CentOS 4.x) >Reporter: Rob Tulloh > Attachments: Word.docx > > > Stack trace produced with attached docx file > 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:45:36.86932 Dumping heap to > /var/log/oom/content-extractor-9998.dump.1 ... > 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 > secs] > 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:46:57.17718 at > java.lang.String.substring(String.java:1939) > 2012-07-13_04:46:57.17736 at > org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254) > 2012-07-13_04:46:57.17750 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) > 2012-07-13_04:46:57.17763 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822) > 2012-07-13_04:46:57.1 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521) > 2012-07-13_04:46:57.17793 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362) > 2012-07-13_04:46:57.17806 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293) > 2012-07-13_04:46:57.17819 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261) > 2012-07-13_04:46:57.17839 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808) > 2012-07-13_04:46:57.17853 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290) > 2012-07-13_04:46:57.17868 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400) > 2012-07-13_04:46:57.17883 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714) > 2012-07-13_04:46:57.17897 at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439) > 2012-07-13_04:46:57.17911 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > 2012-07-13_04:46:57.17929 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > 2012-07-13_04:46:57.17945 at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > 2012-07-13_04:46:57.17962 at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > 2012-07-13_04:46:57.17978 at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134) > 2012-07-13_04:46:57.17991 at > org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) > 2012-07-13_04:46:57.18004 at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116) > 2012-07-13_04:46:57.18019 at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53) > 2012-07-13_04:46:57.18035 at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) > 2012-07-13_04:46:57.18051 at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87) > 2012-07-13_04:46:57.18066 at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) > 2012-07-13_04:46:57.18078 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18090 at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > 2012-07-13_04:46:57.18103 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18115 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > 2012-07-13_04:46:57.18127 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) > 2012-07-13_04:46:57.18146 at > org.apache.tika.server.TikaResource$3.write(TikaResource.java:138) > 2012-07-13_04:46:57.18158 at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117) > 2012-07-13_04:46:57.18169 at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:257) -- This mes
[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file
[ https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414076#comment-13414076 ] Rob Tulloh commented on TIKA-954: - Turns out we are running on CentOS 5.x so I can test with bigger JVM sizes. At 1G, still get OOM. At 2G, did not see OOM, but got no response from the server. > Tika throws OOM and GC limited exceeded on Microsoft docx file > -- > > Key: TIKA-954 > URL: https://issues.apache.org/jira/browse/TIKA-954 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: Linux (CentOS 4.x) >Reporter: Rob Tulloh > Attachments: Word.docx > > > Stack trace produced with attached docx file > 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:45:36.86932 Dumping heap to > /var/log/oom/content-extractor-9998.dump.1 ... > 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 > secs] > 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:46:57.17718 at > java.lang.String.substring(String.java:1939) > 2012-07-13_04:46:57.17736 at > org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254) > 2012-07-13_04:46:57.17750 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) > 2012-07-13_04:46:57.17763 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822) > 2012-07-13_04:46:57.1 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521) > 2012-07-13_04:46:57.17793 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362) > 2012-07-13_04:46:57.17806 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293) > 2012-07-13_04:46:57.17819 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261) > 2012-07-13_04:46:57.17839 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808) > 2012-07-13_04:46:57.17853 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290) > 2012-07-13_04:46:57.17868 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400) > 2012-07-13_04:46:57.17883 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714) > 2012-07-13_04:46:57.17897 at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439) > 2012-07-13_04:46:57.17911 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > 2012-07-13_04:46:57.17929 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > 2012-07-13_04:46:57.17945 at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > 2012-07-13_04:46:57.17962 at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > 2012-07-13_04:46:57.17978 at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134) > 2012-07-13_04:46:57.17991 at > org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) > 2012-07-13_04:46:57.18004 at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116) > 2012-07-13_04:46:57.18019 at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53) > 2012-07-13_04:46:57.18035 at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) > 2012-07-13_04:46:57.18051 at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87) > 2012-07-13_04:46:57.18066 at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) > 2012-07-13_04:46:57.18078 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18090 at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > 2012-07-13_04:46:57.18103 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18115 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > 2012-07-13_04:46:57.18127 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) > 2012-07-13_04:46:57.18146 at > org.apache.tika.server.TikaResource$3.write(TikaResource.java:138) > 2012-07-13_04:46:57.18158 at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117) > 2012-07-13_04:46:57.18169 at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInt
[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file
[ https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414118#comment-13414118 ] Rob Tulloh commented on TIKA-954: - We can provide you the JVM heap dump if you think that is useful. Could be a memory leak of some kind due to the GC limit exceeded message being produced. > Tika throws OOM and GC limited exceeded on Microsoft docx file > -- > > Key: TIKA-954 > URL: https://issues.apache.org/jira/browse/TIKA-954 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: Linux (CentOS 4.x) >Reporter: Rob Tulloh > Attachments: Word.docx > > > Stack trace produced with attached docx file > 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:45:36.86932 Dumping heap to > /var/log/oom/content-extractor-9998.dump.1 ... > 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 > secs] > 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:46:57.17718 at > java.lang.String.substring(String.java:1939) > 2012-07-13_04:46:57.17736 at > org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254) > 2012-07-13_04:46:57.17750 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) > 2012-07-13_04:46:57.17763 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822) > 2012-07-13_04:46:57.1 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521) > 2012-07-13_04:46:57.17793 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362) > 2012-07-13_04:46:57.17806 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293) > 2012-07-13_04:46:57.17819 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261) > 2012-07-13_04:46:57.17839 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808) > 2012-07-13_04:46:57.17853 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290) > 2012-07-13_04:46:57.17868 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400) > 2012-07-13_04:46:57.17883 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714) > 2012-07-13_04:46:57.17897 at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439) > 2012-07-13_04:46:57.17911 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > 2012-07-13_04:46:57.17929 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > 2012-07-13_04:46:57.17945 at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > 2012-07-13_04:46:57.17962 at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > 2012-07-13_04:46:57.17978 at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134) > 2012-07-13_04:46:57.17991 at > org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) > 2012-07-13_04:46:57.18004 at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116) > 2012-07-13_04:46:57.18019 at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53) > 2012-07-13_04:46:57.18035 at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) > 2012-07-13_04:46:57.18051 at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87) > 2012-07-13_04:46:57.18066 at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) > 2012-07-13_04:46:57.18078 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18090 at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > 2012-07-13_04:46:57.18103 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18115 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > 2012-07-13_04:46:57.18127 at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) > 2012-07-13_04:46:57.18146 at > org.apache.tika.server.TikaResource$3.write(TikaResource.java:138) > 2012-07-13_04:46:57.18158 at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:117) > 2012-07-13_04:46:57.18169 at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterce
[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file
[ https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414296#comment-13414296 ] Rob Tulloh commented on TIKA-954: - We bumped the JVM size to 2 GB. We now get an empty reply from the call. Here is what tika reported in the log file. What I cannot tell is if this is a limitation of the server or perhaps curl. I am tempted to believe it is the server rather than curl. The document in question appears to be 3000+ pages of text. 2012-07-14_00:17:40.15182 INFO: tika/12345/Word.docx (autodetecting type) 2012-07-14_01:04:14.43799 Jul 13, 2012 8:04:12 PM org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse t South Africa in 2000 on my unhappy first senior England tour." 2012-07-14_01:04:14.75706 Jul 13, 2012 8:04:12 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogging unwinding now 2012-07-14_01:04:14.75707 org.apache.cxf.interceptor.Fault: Could not send Message. dleMessage(MessageSenderInterceptor.java:64) 2012-07-14_01:04:14.75709 at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:263) ptor.java:77) 2012-07-14_01:04:14.75710 at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:263) a:123) nation.java:323) n.java:289) 2012-07-14_01:04:14.76707 at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:72) 2012-07-14_01:04:14.76707 at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:943) 2012-07-14_01:04:14.76708 at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:879) 2012-07-14_01:04:14.76708 at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) ion.java:250) 2012-07-14_01:04:14.76709 at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110) 2012-07-14_01:04:14.76709 at org.eclipse.jetty.server.Server.handle(Server.java:345) 2012-07-14_01:04:14.76710 at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:441) ava:919) 2012-07-14_01:04:14.76712 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:588) 2012-07-14_01:04:14.76712 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:218) 2012-07-14_01:04:14.76714 at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51) 2012-07-14_01:04:14.76714 at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586) 2012-07-14_01:04:14.76715 at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44) 2012-07-14_01:04:14.76715 at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598) 2012-07-14_01:04:14.76716 at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533) 2012-07-14_01:04:14.76716 at java.lang.Thread.run(Thread.java:662) 2012-07-14_01:04:14.76716 Caused by: org.eclipse.jetty.io.EofException 2012-07-14_01:04:14.76717 at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:921) 2012-07-14_01:04:14.76717 at org.eclipse.jetty.server.HttpConnection.flushResponse(HttpConnection.java:612) 2012-07-14_01:04:14.76718 at org.eclipse.jetty.server.HttpConnection$Output.close(HttpConnection.java:995) 2012-07-14_01:04:14.76718 at org.apache.cxf.transport.http.AbstractHTTPDestination$WrappedOutputStream.close(AbstractHTTPDestination.java:650) 2012-07-14_01:04:14.76720 at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56) 2012-07-14_01:04:14.76721 at org.apache.cxf.transport.http.AbstractHTTPDestination$BackChannelConduit.close(AbstractHTTPDestination.java:593) 2012-07-14_01:04:14.76721 at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:62) 2012-07-14_01:04:14.76722 ... 23 more 2012-07-14_01:04:14.76722 Caused by: java.nio.channels.ClosedChannelException 2012-07-14_01:04:14.76722 at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:135) 2012-07-14_01:04:14.76724 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:357) 2012-07-14_01:04:14.76724 at java.nio.channels.SocketChannel.write(SocketChannel.java:360) 2012-07-14_01:04:14.76725 at org.eclipse.jetty.io.nio.ChannelEndPoint.gatheringFlush(ChannelEndPoint.java:354) 2012-07-14_01:04:14.76725 at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:292) 2012-07-14_01:04:14.76725 at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:300) 2012-07-14_01:04:14.76726 at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:848) 2012-07-14_01:04:14.76726 ... 29 more 2012-07-14_01:04:14.76727 Jul
[jira] [Commented] (TIKA-954) Tika throws OOM and GC limited exceeded on Microsoft docx file
[ https://issues.apache.org/jira/browse/TIKA-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414297#comment-13414297 ] Rob Tulloh commented on TIKA-954: - curl output: * Connected to localhost (127.0.0.1) port 9998 > PUT /tika/12345/Word.docx HTTP/1.1 > User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 > OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5 > Host: localhost:9998 > Accept: */* > Content-Type: application/octet-stream > Content-Length: 4543821 > Expect: 100-continue > < HTTP/1.1 100 Continue % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 4437k0 0 100 4437k 0 12612 0:06:00 0:06:00 --:--:-- 0Empty reply from server 100 4437k0 0 100 4437k 0 12612 0:06:00 0:06:00 --:--:-- 0* Connection #0 to host localhost left intact curl: (52) Empty reply from server * Closing connection #0 > Tika throws OOM and GC limited exceeded on Microsoft docx file > -- > > Key: TIKA-954 > URL: https://issues.apache.org/jira/browse/TIKA-954 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: Linux (CentOS 4.x) >Reporter: Rob Tulloh > Attachments: Word.docx > > > Stack trace produced with attached docx file > 2012-07-13_04:45:36.86910 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:45:36.86932 Dumping heap to > /var/log/oom/content-extractor-9998.dump.1 ... > 2012-07-13_04:46:47.38774 Heap dump file created [925402960 bytes in 70.518 > secs] > 2012-07-13_04:46:57.17658 java.lang.OutOfMemoryError: GC overhead limit > exceeded > 2012-07-13_04:46:57.17718 at > java.lang.String.substring(String.java:1939) > 2012-07-13_04:46:57.17736 at > org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254) > 2012-07-13_04:46:57.17750 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) > 2012-07-13_04:46:57.17763 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822) > 2012-07-13_04:46:57.1 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521) > 2012-07-13_04:46:57.17793 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362) > 2012-07-13_04:46:57.17806 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293) > 2012-07-13_04:46:57.17819 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261) > 2012-07-13_04:46:57.17839 at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808) > 2012-07-13_04:46:57.17853 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290) > 2012-07-13_04:46:57.17868 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400) > 2012-07-13_04:46:57.17883 at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714) > 2012-07-13_04:46:57.17897 at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439) > 2012-07-13_04:46:57.17911 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > 2012-07-13_04:46:57.17929 at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > 2012-07-13_04:46:57.17945 at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > 2012-07-13_04:46:57.17962 at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > 2012-07-13_04:46:57.17978 at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134) > 2012-07-13_04:46:57.17991 at > org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159) > 2012-07-13_04:46:57.18004 at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116) > 2012-07-13_04:46:57.18019 at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53) > 2012-07-13_04:46:57.18035 at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180) > 2012-07-13_04:46:57.18051 at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87) > 2012-07-13_04:46:57.18066 at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) > 2012-07-13_04:46:57.18078 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > 2012-07-13_04:46:57.18090 at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > 2012