[
https://issues.apache.org/jira/browse/NIFI-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew M. Lim updated NIFI-10218:
---------------------------------
Attachment: example.pdf
> ExtractDocumentText processor does not handle certain characters when
> extracting from a PDF
> -------------------------------------------------------------------------------------------
>
> Key: NIFI-10218
> URL: https://issues.apache.org/jira/browse/NIFI-10218
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Reporter: Andrew M. Lim
> Priority: Minor
> Attachments: 625006.pdf, example.pdf
>
>
> When a PDF has special characters ("+", "=",">", "+-"), when the text is
> extracted from the document, these characters show up with different symbols.
> I've attached two PDFs that illustrate the issue differently:
> * 625006.pdf has multiple pages. When the text is extracted from a table,
> certain characters show up as a ? symbol.
> * example.pdf is a single page with the same table. When the text is
> extracted the same characters show up as " or # symbols.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)