zhxiaoping commented on a change in pull request #4179:
URL: https://github.com/apache/zeppelin/pull/4179#discussion_r673628437
##########
File path:
livy/src/main/java/org/apache/zeppelin/livy/LivySparkSQLInterpreter.java
##########
@@ -197,7 +200,18 @@ public FormType getFormType() {
return rows;
}
- protected List<String> parseSQLOutput(String output) {
+ protected List<String> parseSQLOutput(String str) {
+ String fullWidthRegex = "([" +
+ "\u1100-\u115F" +
+ "\u2E80-\uA4CF" +
+ "\uAC00-\uD7A3" +
+ "\uF900-\uFAFF" +
+ "\uFE10-\uFE19" +
+ "\uFE30-\uFE6F" +
+ "\uFF00-\uFF60" +
+ "\uFFE0-\uFFE6" +
+ "])";
+ String output = str.replaceAll(fullWidthRegex, "$1\u0001");
Review comment:
the regex is refered to org.apache.spark.util.Utils#fullWidthRegex

for spark chinese character has two placeholder, one placeholder is one char.
for zeppelin chinese has only one placeholder.
they have different standards.
so zeppelin can not parse columns based on column size.
just because zeppelin take chinese character as one placeholder, but
actually it is two placeholder.
this pr do two things
the first one thing insert a special character (/u0001) which nerver use
after every chinese character,
so zeppline can split string correctly, replace /u0001 with empty string,
before add record to rows
the second thing avoid that chinese character is escaped.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]