Yilei Pan created TIKA-3693:
-------------------------------

             Summary: [OfficeParser] Wingdings font recognition 
                 Key: TIKA-3693
                 URL: https://issues.apache.org/jira/browse/TIKA-3693
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 2.2.1
         Environment: Java 8

Tika 2.2.1
            Reporter: Yilei Pan
         Attachments: example.doc

Hi,

In our word documents we have sometimes windings characters (like the 
checkboxes).

For the moments, all the unrecognized caracters are parsed to '(' which creates 
problems for future treatements on the docment.

 

I saw that the improvements have been made on 
[PDFBox|https://issues.apache.org/jira/browse/PDFBOX-570]. Ideed, when I save 
the word document into pdf and parse the pdf document, we do recognize thoses 
characters.

 

Is it possible to add support for the windigns characters or other unicode 
characters ?

 

Here's an example of the document.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to