[jira] [Updated] (NIFI-10218) ExtractDocumentText processor does not handle certain characters when extracting from a PDF

Andrew M. Lim (Jira) Mon, 11 Jul 2022 12:53:06 -0700


     [ 
https://issues.apache.org/jira/browse/NIFI-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew M. Lim updated NIFI-10218:
---------------------------------
    Attachment: example.pdf

> ExtractDocumentText processor does not handle certain characters when 
> extracting from a PDF
> -------------------------------------------------------------------------------------------
>
>                 Key: NIFI-10218
>                 URL: https://issues.apache.org/jira/browse/NIFI-10218
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>            Reporter: Andrew M. Lim
>            Priority: Minor
>         Attachments: 625006.pdf, example.pdf
>
>
> When a PDF has special characters ("+", "=",">", "+-"), when the text is 
> extracted from the document, these characters show up with different symbols. 
> I've attached two PDFs that illustrate the issue differently:
> * 625006.pdf has multiple pages. When the text is extracted from a table, 
> certain characters show up as a ? symbol.
> * example.pdf is a single page with the same table. When the text is 
> extracted the same characters show up as " or # symbols.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NIFI-10218) ExtractDocumentText processor does not handle certain characters when extracting from a PDF

Reply via email to