[jira] [Created] (TIKA-4015) Extract symbols as symbols from .docx

Tim Allison (Jira) Wed, 12 Apr 2023 12:51:09 -0700

Tim Allison created TIKA-4015:
---------------------------------

             Summary: Extract symbols as symbols from .docx
                 Key: TIKA-4015
                 URL: https://issues.apache.org/jira/browse/TIKA-4015
             Project: Tika
          Issue Type: New Feature
            Reporter: Tim Allison
         Attachments: symbol.docx.zip


[~chetab] raised this issue on the user list.and supplied an example document.

The Font is symbol and the text should be: abcedefghijklmnopqrstuvwxyz

However, the text as literally stored in the docx and extracted by Tika is: 
abcedefghijklmnopqrstuvwxyz

 

We may need to add processing for unicode mappings or the equivalent in ooxml.  
I haven't seen this before. :P



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4015) Extract symbols as symbols from .docx

Reply via email to