Re: [PR] [feat](catalog) Support reading Hive table with MultiDelimitSerDe [doris]

via GitHub Mon, 23 Jun 2025 02:56:48 -0700


suxiaogang223 commented on code in PR #51936:
URL: https://github.com/apache/doris/pull/51936#discussion_r2160799189



##########
be/src/vec/exec/format/text/text_reader.cpp:
##########
@@ -41,18 +44,62 @@ namespace doris::vectorized {
 void HiveTextFieldSplitter::do_split(const Slice& line, std::vector<Slice>* 
splitted_values) {
     const char* data = line.data;
     const size_t size = line.size;
-    size_t value_start = 0;
-    for (size_t i = 0; i < size; ++i) {
-        if (data[i] == _value_sep[0]) {
-            // hive will escape the field separator in string
-            if (_escape_char != 0 && i > 0 && data[i - 1] == _escape_char) {
-                continue;
+    if (_value_sep_len == 1) {

Review Comment:
   It isbetter to abstract the branch code into two functions, refer to 
PlainCsvTextFieldSplitter:
   ```c++
   void PlainCsvTextFieldSplitter::do_split(const Slice& line, 
std::vector<Slice>* splitted_values) {
       if (is_single_char_delim) {
           _split_field_single_char(line, splitted_values);
       } else {
           _split_field_multi_char(line, splitted_values);
       }
   }
   ```



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/hive/source/HiveScanNode.java:
##########
@@ -470,6 +472,25 @@ protected TFileAttributes getFileAttributes() throws 
UserException {
             fileAttributes.setHeaderType("");
             fileAttributes.setEnableTextValidateUtf8(
                     sessionVariable.enableTextValidateUtf8);
+        } else if 
(serDeLib.equals(HiveMetaStoreClientHelper.HIVE_MULTI_DELIMIT_SERDE)) {
+            TFileTextScanRangeParams textParams = new 
TFileTextScanRangeParams();
+            // set properties of MultiDelimitSerDe
+            // 1. set column separator (support multi-character delimiters)
+            
textParams.setColumnSeparator(HiveProperties.getMultiDelimitFieldDelimiter(table));
+            // 2. set line delimiter
+            
textParams.setLineDelimiter(HiveProperties.getLineDelimiter(table));
+            // 3. set mapkv delimiter
+            
textParams.setMapkvDelimiter(HiveProperties.getMapKvDelimiter(table));
+            // 4. set collection delimiter
+            
textParams.setCollectionDelimiter(HiveProperties.getCollectionDelimiter(table));
+            // 5. set escape delimiter
+            HiveProperties.getEscapeDelimiter(table).ifPresent(d -> 
textParams.setEscape(d.getBytes()[0]));
+            // 6. set null format
+            textParams.setNullFormat(HiveProperties.getNullFormat(table));
+            fileAttributes.setTextParams(textParams);
+            fileAttributes.setHeaderType("");
+            fileAttributes.setEnableTextValidateUtf8(
+                    sessionVariable.enableTextValidateUtf8);

Review Comment:
   These code is similar to 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe`, should be merged together



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [feat](catalog) Support reading Hive table with MultiDelimitSerDe [doris]

Reply via email to