[jira] [Created] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

Ewan Mellor (JIRA) Wed, 21 Feb 2018 13:08:22 -0800

Ewan Mellor created TIKA-2582:
---------------------------------

             Summary: Tesseract 4.0 includes a FF character by default, 
breaking parsers
                 Key: TIKA-2582
                 URL: https://issues.apache.org/jira/browse/TIKA-2582
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.17
            Reporter: Ewan Mellor



Tesseract 4.0 includes a change to use form feed characters to separate pages 
by default in its text output.  Previous versions used no separator unless you 
specified the include_page_breaks option.

This confuses any parser that is not expecting the FF. 
ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
blank image to be the empty string, but now the FF is there.

I haven't seen any other failures, but I expect that user code will now see 
either FF or U+FFFD where they are not expecting it (SafeContentHandler 
replaces the FF with U+FFFD when converting to text to XML).

We should set the appropriate Tesseract options to disable this behavior unless 
explicitly requested by user code, to avoid the change in behavior.

For reference, the Tesseract change is as follows:

{{commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
Merge: 3bb573ae aa6eb6bd
Author: zdenop <zde...@gmail.com>
Date:   Tue Sep 19 08:41:08 2017 +0200

    Merge pull request #1140 from stweil/pagebreak

    Remove Tesseract parameter "include_page_breaks" and use FF by default

commit aa6eb6bd466101a3b89880f87580471a7694359d
Author: Stefan Weil <s...@weilnetz.de>
Date:   Mon Jun 12 19:42:45 2017 +0200

    Remove Tesseract parameter "include_page_breaks" and use FF by default

    Now Tesseract adds a page break (normally form feed) by default.

    It is still possible to suppress page breaks by setting an empty
    page_separator.

    Signed-off-by: Stefan Weil <s...@weilnetz.de>
}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

Reply via email to