Tim Allison created TIKA-4015:
---------------------------------
Summary: Extract symbols as symbols from .docx
Key: TIKA-4015
URL: https://issues.apache.org/jira/browse/TIKA-4015
Project: Tika
Issue Type: New Feature
Reporter: Tim Allison
Attachments: symbol.docx.zip
[~chetab] raised this issue on the user list.and supplied an example document.
The Font is symbol and the text should be: abcedefghijklmnopqrstuvwxyz
However, the text as literally stored in the docx and extracted by Tika is:
abcedefghijklmnopqrstuvwxyz
We may need to add processing for unicode mappings or the equivalent in ooxml.
I haven't seen this before. :P
--
This message was sent by Atlassian Jira
(v8.20.10#820010)