hoshinojyunn commented on code in PR #64794:
URL: https://github.com/apache/doris/pull/64794#discussion_r3480088214


##########
fe/fe-core/src/main/java/org/apache/doris/analysis/InvertedIndexUtil.java:
##########
@@ -143,6 +143,34 @@ private static boolean isSingleByte(String str) {
         return true;
     }
 
+    public static void checkCharFilterProperties(Map<String, String> 
properties) throws AnalysisException {
+        String charFilterType = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_TYPE);
+        if (charFilterType == null) {
+            return;
+        }
+
+        String charFilterPattern = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_PATTERN);
+        String charFilterReplacement = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_REPLACEMENT);
+        if (!INVERTED_INDEX_CHAR_FILTER_CHAR_REPLACE.equals(charFilterType)) {
+            throw new AnalysisException("Invalid 'char_filter_type', only '"
+                    + INVERTED_INDEX_CHAR_FILTER_CHAR_REPLACE + "' is 
supported");
+        }
+        if (charFilterPattern == null || charFilterPattern.isEmpty()) {
+            throw new AnalysisException("Missing 'char_filter_pattern' for 
'char_replace' filter type");
+        }
+        if (!isSingleByte(charFilterPattern)) {
+            throw new AnalysisException("'char_filter_pattern' must contain 
only ASCII characters");
+        }
+        if (charFilterReplacement != null) {
+            if (charFilterReplacement.isEmpty() || 
charFilterReplacement.length() != 1) {

Review Comment:
   Previously, the `isSingleByte` function checked if `char <= 0xFF`, whereas 
standard ASCII requires `char <= 0x7F`. The former condition applies to 
extended encodings like ISO-8859-1; however, the BE processes data using UTF-8. 
This caused issues during pattern replacement in `char_filter`, where only the 
first byte was used as the replacement. For instance, the character "é" is 
`0xE9` in Latin-1 but becomes `0xC3 0xA9` (a two-byte sequence) in UTF-8; the 
BE incorrectly used only the first byte (`0xC3`) for the replacement. The 
`isSingleByte` function in the FE has now been updated to a strict `isAscii` 
check, requiring the replacement to be a single-byte character (which <= 
`0x7F`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to