[ https://issues.apache.org/jira/browse/TIKA-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372860#comment-16372860 ]
ASF GitHub Bot commented on TIKA-2580: -------------------------------------- tballison closed pull request #220: Fix for TIKA-2580 contributed by ewanmellor. URL: https://github.com/apache/tika/pull/220 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java b/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java index d3152c680..f82098493 100644 --- a/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java +++ b/tika-core/src/main/java/org/apache/tika/sax/SafeContentHandler.java @@ -31,7 +31,8 @@ * ({@link #characters(char[], int, int)} or * {@link #ignorableWhitespace(char[], int, int)}) passed to the decorated * content handler contain only valid XML characters. All invalid characters - * are replaced with spaces. + * are replaced with the Unicode replacement character U+FFFD (though a + * subclass may change this by overriding the writeReplacement method). * <p> * The XML standard defines the following Unicode character ranges as * valid XML characters: ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SafeContentHandler documentation is incorrect about replacement character > ------------------------------------------------------------------------- > > Key: TIKA-2580 > URL: https://issues.apache.org/jira/browse/TIKA-2580 > Project: Tika > Issue Type: Bug > Components: documentation > Affects Versions: 1.17 > Reporter: Ewan Mellor > Priority: Minor > > SafeContentHandler's doc comment states "All invalid characters are replaced > with spaces." This has been untrue since TIKA-698 (Sep 2011). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)