Yilei Pan created TIKA-3693:
-------------------------------
Summary: [OfficeParser] Wingdings font recognition
Key: TIKA-3693
URL: https://issues.apache.org/jira/browse/TIKA-3693
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 2.2.1
Environment: Java 8
Tika 2.2.1
Reporter: Yilei Pan
Attachments: example.doc
Hi,
In our word documents we have sometimes windings characters (like the
checkboxes).
For the moments, all the unrecognized caracters are parsed to '(' which creates
problems for future treatements on the docment.
I saw that the improvements have been made on
[PDFBox|https://issues.apache.org/jira/browse/PDFBOX-570]. Ideed, when I save
the word document into pdf and parse the pdf document, we do recognize thoses
characters.
Is it possible to add support for the windigns characters or other unicode
characters ?
Here's an example of the document.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)