ongdisheng opened a new pull request, #37: URL: https://github.com/apache/asterixdb/pull/37
## Description There are two bugs in `writeUTF8StringAsCSV` in [PrintTools.java](https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/printers/PrintTools.java#L305C4-L305C12): 1. Incorrect loop step in the quoting scan: The loop that checks whether a string needs quoting advanced one byte at a time (`i++`). Since multi-byte characters span 2 to 4 bytes, this might actually cause `charAt()` to be called at offsets pointing to the middle of a character. 2. Incorrect character writing: Characters were written using `PrintStream.print(char)`, which converts the `char` through the platform default charset before writing. For emoji encoded as surrogate pairs, each surrogate half is not a valid standalone Unicode character, so `PrintStream` emitted replacement characters (`?`) instead of the correct UTF-8 bytes. ## Fix Added a fix for the quoting scan loop so that it now advances by `UTF8StringUtil.charSize()` per iteration and `charAt()` is always called at a valid character boundary. Characters are now written as raw UTF-8 bytes directly, which is also consistent with how `writeUTF8StringAsJSON` already handles the same data. ## How to Reproduce and Verify <details> <summary>Setup</summary> ```bash disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode 'statement= DROP DATAVERSE test IF EXISTS; CREATE DATAVERSE test; USE test; CREATE TYPE TweetType AS { id: int, text: string }; CREATE DATASET tweets(TweetType) PRIMARY KEY id; ' "http://localhost:19002/query/service" disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode 'statement= USE test; INSERT INTO tweets ({"id": 1, "text": "@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"}); ' "http://localhost:19002/query/service" ``` </details> <details> <summary>Before fix</summary> ```bash disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;" "http://localhost:19002/query/service" { "requestID": "3056c5df-fb14-4ff8-90b6-dcc62662a563", "signature": { "*": "*" }, "results": [ {"text":"@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"} ] , "plans":{}, "status": "success", "metrics": { "elapsedTime": "28.199702ms", "executionTime": "26.761674ms", "compileTime": "10.872533ms", "queueWaitTime": "0ns", "resultCount": 1, "resultSize": 111, "processedObjects": 1 } } disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;" "http://localhost:19002/query/service?format=csv&header=absent" { "requestID": "6bbf0b3d-05a3-43e5-ad61-d3ebeee20e5e", "type": "text/csv; header=absent", "signature": { "*": "*" }, "errors": [{ "code": 1, "msg": "java.lang.IllegalArgumentException" } ], "status": "fatal", "metrics": { "elapsedTime": "26.795447ms", "executionTime": "25.588135ms", "compileTime": "11.320646ms", "queueWaitTime": "0ns", "resultCount": 0, "resultSize": 0, "processedObjects": 0, "bufferCacheHitRatio": "0.00%", "bufferCachePageReadCount": 0, "errorCount": 1 } } ``` </details> <details> <summary>After fix</summary> ```bash disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;" \ "http://localhost:19002/query/service" { "requestID": "6f404f34-1726-42d7-ba7a-990d9b08cd0d", "signature": { "*": "*" }, "results": [ {"text":"@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"} ] , "plans":{}, "status": "success", "metrics": { "elapsedTime": "138.660107ms", "executionTime": "134.309116ms", "compileTime": "43.763632ms", "queueWaitTime": "0ns", "resultCount": 1, "resultSize": 111, "processedObjects": 1 } } disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;" \ "http://localhost:19002/query/service?format=csv&header=absent" { "requestID": "855ebc1a-6816-49ef-bbcb-7e99fe6e5ef0", "type": "text/csv; header=absent", "signature": { "*": "*" }, "results": [ "@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋" ] , "plans":{}, "status": "success", "metrics": { "elapsedTime": "34.516791ms", "executionTime": "32.945963ms", "compileTime": "11.91599ms", "queueWaitTime": "1ms", "resultCount": 1, "resultSize": 102, "processedObjects": 1 } } ``` </details> ## JIRA Issue https://issues.apache.org/jira/browse/ASTERIXDB-2877 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
