[GitHub] [zeppelin] zhxiaoping commented on a change in pull request #4179: [ZEPPELIN-5458] fix that zepplin can not parse columns which contains chinese character

GitBox Tue, 20 Jul 2021 20:10:19 -0700


zhxiaoping commented on a change in pull request #4179:
URL: https://github.com/apache/zeppelin/pull/4179#discussion_r673628437




##########
File path: 
livy/src/main/java/org/apache/zeppelin/livy/LivySparkSQLInterpreter.java
##########
@@ -197,7 +200,18 @@ public FormType getFormType() {
     return rows;
   }
 
-  protected List<String> parseSQLOutput(String output) {
+  protected List<String> parseSQLOutput(String str) {
+    String fullWidthRegex = "([" +
+            "\u1100-\u115F" +
+            "\u2E80-\uA4CF" +
+            "\uAC00-\uD7A3" +
+            "\uF900-\uFAFF" +
+            "\uFE10-\uFE19" +
+            "\uFE30-\uFE6F" +
+            "\uFF00-\uFF60" +
+            "\uFFE0-\uFFE6" +
+            "])";
+    String output = str.replaceAll(fullWidthRegex, "$1\u0001");

Review comment:
       the regex is refered to org.apache.spark.util.Utils#fullWidthRegex
   
   
![image](https://user-images.githubusercontent.com/47968604/126423231-b7586739-1f5c-4529-8403-cc6115dee2ba.png)
   
   for spark chinese character has two placeholder, one placeholder is one char.
   for zeppelin chinese has only one placeholder.  
   they have different standards.
   so zeppelin can not parse columns based on column size.
   just because zeppelin take chinese character as one placeholder, but 
actually it is two placeholder.
   
   this pr do two things
   the first one thing  insert a special character (/u0001)  which nerver use  
after every chinese character,
   so zeppline can split string correctly,  replace /u0001 with empty string, 
before add record to rows
   
   the second thing  avoid that chinese character is escaped.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@zeppelin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [zeppelin] zhxiaoping commented on a change in pull request #4179: [ZEPPELIN-5458] fix that zepplin can not parse columns which contains chinese character

Reply via email to