[ 
https://issues.apache.org/jira/browse/TIKA-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-557.
-----------------------------

    Resolution: Invalid

You've set a Write Limit on your ContentHandler, and the text in your PDF is 
too big

If you don't want to restrict yourself on the size of documents, use an 
unbounded handler. eg when creating a BodyContentHandler, don't specify a limit 
in the constructor

> Extract text file PDF error
> ---------------------------
>
>                 Key: TIKA-557
>                 URL: https://issues.apache.org/jira/browse/TIKA-557
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Them Ta
>         Attachments: QA.pdf
>
>
> File to extract text: QA.pdf
> I tested pdfbox 1.3.1 to extract in console and it worked fine, but by tika 
> (just this file is error) the log error is:
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException
>       at 
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:120)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:153)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)
>       at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)
>       at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:113)
>       at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:261)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:287)
>       at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:113)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeLine(PDFTextStripper.java:1819)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:727)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:365)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:321)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:241)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:90)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to