[ 
https://issues.apache.org/jira/browse/PDFBOX-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964788#action_12964788
 ] 

Martijn Brinkers commented on PDFBOX-860:
-----------------------------------------

What happens is that fi gets converted to the unicode character "fi" (see 
http://www.fileformat.info/info/unicode/char/fb01/index.htm). Probably your 
conversion to RTF (or other format) corrupts the unicode character. 
PDFTextStripper normalizes the text which should result in the unicode char 
"fi" being converted to "f" "i" (i.e., two separated chars). TextNormalize 
however will only be used if the com.ibm.icu.text.* packages can be found on 
the classpath. Could it be that you are missing the ICU jar?

> 'fi' getting converted to '?'
> -----------------------------
>
>                 Key: PDFBOX-860
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-860
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Solaris 10
>            Reporter: Saurabh Mehrotra
>         Attachments: INSI-SURVIVAL-GUIDE-4-JOURNALISTS.zip, new_evidence.zip
>
>
> Hi
> I am trying to use PDF box 1.2.1 version to extract text from PDF files.
> The following issue is observed in the extracted text:
> 1. Combination of the characters 'fi' is converted to a '?'
> example:  first becomes ?rst
>                   classifier becomes classi?er
>                   find becomes ?nd
> Is this a known bug? Can some setting of the PDF box be turned of to prevent 
> this?
> Thanks & Regards
> Saurabh

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to