SUBSTITUDE opened a new pull request, #64651:
URL: https://github.com/apache/doris/pull/64651

   ## Problem
   
   In non-strict load mode, when importing strings containing multi-byte UTF-8 
characters
   (e.g., Chinese characters, special Unicode like U+0131 'ı') into VARCHAR 
columns,
   the truncation fails and rows are incorrectly rejected.
   
   ### Root Cause
   
   The `substring(str, 1, limit)` function interprets the third parameter 
differently:
   - ASCII path: treats it as byte count
   - UTF-8 path: treats it as character count
   
   When `limit` (a byte count derived from VARCHAR(N)) is passed to substring, 
the UTF-8
   path fails to truncate strings where byte count exceeds the limit but 
character count does not.
   
   ### Example
   
   ```sql
   CREATE TABLE test_table (env VARCHAR(32) NOT NULL DEFAULT '')
   -- non-strict mode routine load
   Input: "${jnd${upper:ı}:ldap://test.comxxxxxx}"; (33 bytes, 32 chars due to 
'ı' = 2 bytes)
   
   substring(str, 1, 32) → keeps all 32 chars = 33 bytes
   Validation: 33 > 32 → REJECTED ❌
   Input: "中123456789012345678901234567890" (33 bytes, 31 chars due to '中' = 3 
bytes)
   
   substring(str, 1, 32) → keeps all 31 chars = 33 bytes
   Validation: 33 > 32 → REJECTED ❌
   Fix
   After the substring truncation, for rows that still exceed the byte limit in 
non-strict mode,
   manually truncate each string at a valid UTF-8 character boundary. The 
algorithm walks backwards
   from the limit position, skipping UTF-8 continuation bytes (0x80-0xBF), to 
find the last
   non-continuation byte position.
   
   Changes
   be/src/exec/sink/vtablet_block_convertor.cpp: Modified 
_internal_validate_column to add UTF-8 boundary-aware truncation in non-strict 
load mode
   Related
   Closes #64334
   The same byte-vs-char mismatch also exists in file_scanner.cpp's 
_truncate_char_or_varchar_column and should be addressed in a follow-up PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to