[
https://issues.apache.org/jira/browse/PDFBOX-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Huan LI updated PDFBOX-1304:
----------------------------
Attachment: fj.txt
fj.pdf
> Text extraction meets "Could not parse predefined CMAP" and returns just a
> small part of the content containing garbage chars.
> ------------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-1304
> URL: https://issues.apache.org/jira/browse/PDFBOX-1304
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.6.0
> Environment: Win7 32bits
> Reporter: Huan LI
> Attachments: fj.pdf, fj.txt
>
>
> i'm using pdfbox-1.6.0 for text extraction from a Chinese pdf file(see the
> attachment "fj.pdf").
>
> the extraction code looks like below:
> [code]
> stripper = new PDFTextStripper(encoding);
> txt = stripper.getText(_pdfDoc);
> [/code]
> when running getText(), the console says :
> [console]
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUO1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> [/console]
> after getText() returns, the txt contains just a small part of the pdf
> content (lots are missing) and some garbage chars like "犖犑狌犣犎犗犝犔犻犺犅"(see
> attachment "fj.txt").
>
> I've heard some said that the "org.apache.pdfbox.cos.COSString.java" has some
> errors when pdfbox-0.7.3. Has COSString.java been corrected in 1.6.0?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira