(impala) branch master updated: IMPALA-889: Add trim() function matching ANSI SQL definition

csringhofer Mon, 02 Dec 2024 13:22:29 -0800

This is an automated email from the ASF dual-hosted git repository.

csringhofer pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git



The following commit(s) were added to refs/heads/master by this push:
     new 81f267388 IMPALA-889: Add trim() function matching ANSI SQL definition
81f267388 is described below

commit 81f2673883f65aa71f682bf9fda6dd73888e75a8
Author: Mihaly Szjatinya <[email protected]>
AuthorDate: Sun Dec 1 20:23:28 2024 +0100

    IMPALA-889: Add trim() function matching ANSI SQL definition
    
    As agreed in JIRA discussions, the current PR extends existing TRIM
    functionality with the support of SQL-standardized TRIM-FROM syntax:
    TRIM({[LEADING / TRAILING / BOTH] | [STRING characters]} FROM expr).
    Implemented based on the existing LTRIM / RTRIM / BTRIM family of
    functions prepared earlier in IMPALA-6059 and extended for UTF-8 in
    IMPALA-12718. Besides, partly based on abandoned PR
    https://gerrit.cloudera.org/#/c/4474 and similar EXTRACT-FROM
    functionality from https://github.com/apache/impala/commit/543fa73f3a846
    f0e4527514c993cb0985912b06c.
    
    Supported syntaxes:
    Syntax #1 TRIM(<where> FROM <string>);
    Syntax #2 TRIM(<charset> FROM <string>);
    Syntax #3 TRIM(<where> <charset> FROM <string>);
    
    "where": Case-insensitive trim direction. Valid options are "leading",
      "trailing", and "both". "leading" means trimming characters from the
      start; "trailing" means trimming characters from the end; "both" means
      trimming characters from both sides. For Syntax #2, since no "where"
      is specified, the option "both" is implied by default.
    
    "charset": Case-sensitive characters to be removed. This argument is
      regarded as a character set going to be removed. The occurrence order
      of each character doesn't matter and duplicated instances of the same
      character will be ignored. NULL argument implies " " (standard space)
      by default. Empty argument ("" or '') makes TRIM return the string
      untouched. For Syntax #1, since no "charset" is specified, it trims
      " " (standard space) by default.
    
    "string": Case-sensitive target string to trim. This argument can be
      NULL.
    
    The UTF8_MODE query option is honored by TRIM-FROM, similarly to
    existing TRIM().
    
    UTF8_TRIM-FROM can be used to force UTF8 mode regardless of the query
    option.
    
    Design Notes:
    1. No-BE. Since the existing LTRIM / RTRIM / BTRIM functions fully cover
    all needed use-cases, no backend logic is required. This differs from
    similar EXTRACT-FROM.
    
    2. Syntax wrapper. TrimFromExpr class was introduced as a syntax
    wrapper around FunctionCallExpr, which instantiates one of the regular
    LTRIM / RTRIM / BTRIM functions. TrimFromExpr's role is to maintain
    the integrity of the "phantom" TRIM-FROM built-in function.
    
    3. No TRIM keyword. Following EXTRACT-FROM, no "TRIM" keyword was
    added to the language. Although generally a keyword would allow easier
    and better parsing, on the negative side it restricts token's usage in
    general context. However, leading/trailing/both, being previously
    saved as reserved words, are now added as keywords to make possible
    their usage with no escaping.
    
    Change-Id: I3c4fa6d0d8d0684c4b6d8dac8fd531d205e4f7b4
    Reviewed-on: http://gerrit.cloudera.org:8080/21825
    Reviewed-by: Csaba Ringhofer <[email protected]>
    Tested-by: Csaba Ringhofer <[email protected]>
---
 be/src/exprs/expr-test.cc                          |  57 ++++++
 docs/topics/impala_string_functions.xml            |  27 ++-
 fe/src/main/cup/sql-parser.cup                     |  50 +++++-
 .../org/apache/impala/analysis/TrimFromExpr.java   | 193 +++++++++++++++++++++
 fe/src/main/jflex/sql-scanner.flex                 |   3 +
 .../apache/impala/analysis/AnalyzeExprsTest.java   |  61 ++++++-
 .../java/org/apache/impala/analysis/ToSqlTest.java |   7 +
 .../queries/QueryTest/utf8-string-functions.test   | 182 +++++++++++++++++++
 8 files changed, 571 insertions(+), 9 deletions(-)

diff --git a/be/src/exprs/expr-test.cc b/be/src/exprs/expr-test.cc
index f987eed9e..f186fe1bd 100644
--- a/be/src/exprs/expr-test.cc
+++ b/be/src/exprs/expr-test.cc
@@ -4765,6 +4765,63 @@ TEST_P(ExprTest, StringFunctions) {
   TestStringValue("btrim('æeioü','æü')", "eio");
   TestStringValue("btrim('\\\\abcdefg\\\\', 'ag\\\\')", "bcdef");
 
+  /// Test cases for TRIM UDF
+  // TRIM(where FROM string): trim space by default;
+  TestStringValue("trim(leading FROM '   1 2 3 4 5   ')", "1 2 3 4 5   ");
+  TestStringValue("trim(leAdiNG FrOm '  212  ')", "212  ");
+  TestStringValue("trim(leading from 'blackhole')", "blackhole");
+  TestStringValue("trim(trailing FroM '  13 1 3 4 6 1 24   ')", "  13 1 3 4 6 
1 24");
+  TestStringValue("trim(tRAIlinG fROm '  1a 9b 4c 5d     ')", "  1a 9b 4c 5d");
+  TestStringValue("trim(trailing from '  abcdefg  ')", "  abcdefg");
+  TestStringValue("trim(bOTh frOm '   ab c d e f 1234  ')", "ab c d e f 1234");
+  TestStringValue("trim(both from '  aaaaaaaaa  ')", "aaaaaaaaa");
+  TestStringValue("trim(both from 'pulsar')", "pulsar");
+
+  // TRIM(string FROM string): trim from both sides by default;
+  TestStringValue("trim('1234' FroM '23471364134612413434')", "713641346");
+  TestStringValue("trim('abc' from 'abacdefg')", "defg");
+  TestStringValue("trim('xyz' fROm 'abcdabcdabc')", "abcdabcdabc");
+  TestStringValue("trim('aaaaabbbbbccccccccffffffffg' from 'abcdefg')", "de");
+  TestStringValue("trim('' fROM '  helloworld  ')", "  helloworld  ");
+  TestStringValue("trim(' ' from '  helloworld  ')", "helloworld");
+  TestStringValue("trim(NULL FROM '  helloworld  ')", "  helloworld  ");
+  TestStringValue("trim('both' FROM 'boneth')", "ne");
+
+  // TRIM(where string FROM string): regular test cases
+  // leading/trailing
+  TestStringValue("trim(leading 'rt' froM 'rrrrssssstttttt')", "ssssstttttt");
+  TestStringValue("trim(trailing '1234' FroM '23471364134612413434')", 
"234713641346");
+  TestStringValue("trim(TRAILING 'a0' from '0ac2aa0aa00000a')", "0ac2");
+  TestStringValue("trim(trailing 'rt' FROm 'rrrrssssstttttt')", "rrrrsssss");
+  TestStringValue("trim(tRAIlinG 'rt' FROM 'rrrrssssstttttt')", "rrrrsssss");
+  // both
+  TestStringValue("trim(both 'aaaaaaaaaaaaaaaaaaaaabg' from 'abcdefg')", 
"cdef");
+  TestStringValue("trim(bOTh 'a' from 'aaaaaaaaa')", "");
+  // Empty source strings return empty results
+  TestStringValue("trim(leading 'abc' from '')", "");
+  TestStringValue("trim(trailing 'xyzabcrst' from '')", "");
+  TestStringValue("trim(both 'aeiou' from '')", "");
+  // Null source strings return null results
+  TestIsNull("trim(leading '' from NULL)", TYPE_STRING);
+  TestIsNull("trim(trailing '' from NULL)", TYPE_STRING);
+  TestIsNull("trim(both 'aeiou' from NULL)", TYPE_STRING);
+  TestIsNull("trim(BoTh 'abc' from NULL)", TYPE_STRING);
+  // Special cases for string_expr nonterminal
+  TestStringValue("trim(both 'ao' from 'aaYYBBoo')", "YYBB");
+  TestStringValue("trim(both from '  YYBB ')", "YYBB");
+  TestStringValue("trim(+'aaoo' from 'aaYYBBoo')", "YYBB");
+  TestStringValue("trim(upper('a') from 'AAYYBB')", "YYBB");
+  TestStringValue(
+      "trim(replace('hello world', 'world', 'earth') from 'aaYYBBoo')",
+      "YYBB");
+  TestStringValue("trim(left('12ao5BY',5) from 'aaYYBBoo')", "YYBB");
+  TestStringValue("trim(right('BY54ao1',5) from 'aaYYBBoo')", "YYBB");
+  TestStringValue("trim(if((1=2) ,NULL, 'ao') from 'aaYYBBoo')", "YYBB");
+  TestStringValue("trim(cast(12.45 as string) from '1YYBB2')", "YYBB");
+  TestStringValue(
+      "trim(case when 1 = 2 then NULL else 'ao' end from 'aaYYBBoo')",
+      "YYBB");
+
   TestStringValue("space(0)", "");
   TestStringValue("space(-1)", "");
   TestStringValue("space(cast(1 as bigint))", " ");
diff --git a/docs/topics/impala_string_functions.xml 
b/docs/topics/impala_string_functions.xml
index 5772911ee..59ae62b69 100644
--- a/docs/topics/impala_string_functions.xml
+++ b/docs/topics/impala_string_functions.xml
@@ -1805,7 +1805,7 @@ select replace('hello world','xyz','abc');
       <dlentry id="trim">
 
         <dt>
-          TRIM(STRING a)
+          TRIM(STRING a), TRIM({{LEADING | TRAILING | BOTH} [STRING 
chars_to_trim] | STRING chars_to_trim} FROM STRING a)
         </dt>
 
         <dd>
@@ -1819,6 +1819,31 @@ select replace('hello world','xyz','abc');
             besides spaces, see <codeph>BTRIM()</codeph>.
           </p>
 
+          <p>
+            TRIM-FROM syntax is a SQL-standardized wrapper around
+            <codeph>LTRIM</codeph> / <codeph>RTRIM</codeph> / 
<codeph>BTRIM</codeph>.
+            Depending on the first parameter <codeph>LEADING</codeph> /
+            <codeph>TRAILING</codeph> / <codeph>BOTH</codeph>, wrapper 
resolves itself to
+            the corresponding underlying function. Default value is 
<codeph>BOTH</codeph>.
+            If present, the <codeph>chars_to_trim</codeph> parameter is passed 
forward to
+            the underlying function. Thus, given syntax may come in three 
different forms:
+          </p>
+
+          <p>
+            Syntax #1: <codeph>TRIM(&lt;where&gt; FROM 
&lt;string&gt;)</codeph> --&gt;
+            <codeph>{L|R|B}TRIM(string)</codeph>;
+          </p>
+
+          <p>
+            Syntax #2: <codeph>TRIM(&lt;charset&gt; FROM 
&lt;string&gt;)</codeph>  --&gt;
+            <codeph>BTRIM(string, charset)</codeph>;
+          </p>
+
+          <p>
+            Syntax #3: <codeph>TRIM(&lt;where&gt; &lt;charset&gt; FROM 
&lt;string&gt;)</codeph>
+            --&gt;<codeph>{L|R|B}TRIM(string, charset)</codeph>.
+          </p>
+
           <p>
             <b>Return type:</b> <codeph>STRING</codeph>
           </p>
diff --git a/fe/src/main/cup/sql-parser.cup b/fe/src/main/cup/sql-parser.cup
index 34e4b771a..ab417fd79 100755
--- a/fe/src/main/cup/sql-parser.cup
+++ b/fe/src/main/cup/sql-parser.cup
@@ -303,7 +303,7 @@ parser code {:
 terminal
   KW_ADD, KW_AGGREGATE, KW_ALL, KW_ALTER, KW_ANALYTIC, KW_AND, KW_ANTI, 
KW_API_VERSION,
   KW_ARRAY, KW_AS, KW_ASC, KW_AUTHORIZATION, KW_AVRO, KW_BETWEEN, KW_BIGINT, 
KW_BINARY,
-  KW_BLOCKSIZE, KW_BOOLEAN, KW_BY, KW_CACHED, KW_CASCADE, KW_CASE, KW_CAST, 
KW_CHANGE,
+  KW_BLOCKSIZE, KW_BOOLEAN, KW_BOTH, KW_BY, KW_CACHED, KW_CASCADE, KW_CASE, 
KW_CAST, KW_CHANGE,
   KW_CHAR, KW_CLASS, KW_CLOSE_FN, KW_COLUMN, KW_COLUMNS, KW_COMMENT, 
KW_COMPRESSION,
   KW_COMPUTE, KW_CONSTRAINT, KW_CONVERT, KW_COPY, KW_CREATE, KW_CROSS, 
KW_CUBE, KW_CURRENT,
   KW_DATA, KW_DATABASE, KW_DATABASES, KW_DATE, KW_DATETIME, KW_DECIMAL, 
KW_DEFAULT, KW_DELETE,
@@ -315,8 +315,8 @@ terminal
   KW_GROUP, KW_GROUPING, KW_HASH, KW_HUDIPARQUET, KW_IGNORE, KW_HAVING, 
KW_ICEBERG, KW_IF,
   KW_ILIKE, KW_IN, KW_INCREMENTAL, KW_INIT_FN, KW_INNER, KW_INPATH, KW_INSERT, 
KW_INT,
   KW_INTERMEDIATE, KW_INTERSECT, KW_INTERVAL, KW_INTO, KW_INVALIDATE, 
KW_IREGEXP, KW_IS,
-  KW_JDBC, KW_JOIN, KW_JSONFILE, KW_KUDU, KW_LAST, KW_LEFT, KW_LEXICAL, 
KW_LIKE, KW_LIMIT,
-  KW_LINES, KW_LOAD, KW_LOCATION, KW_LOGICAL_OR, KW_MANAGED_LOCATION, KW_MAP, 
KW_MATCHED,
+  KW_JDBC, KW_JOIN, KW_JSONFILE, KW_KUDU, KW_LAST, KW_LEADING, KW_LEFT, 
KW_LEXICAL, KW_LIKE,
+  KW_LIMIT, KW_LINES, KW_LOAD, KW_LOCATION, KW_LOGICAL_OR, 
KW_MANAGED_LOCATION, KW_MAP, KW_MATCHED,
   KW_MERGE, KW_MERGE_FN, KW_METADATA, KW_MINUS, KW_NON, KW_NORELY, KW_NOT,
   KW_NOVALIDATE, KW_NULL, KW_NULLS, KW_OF, KW_OFFSET, KW_ON, KW_OPTIMIZE, 
KW_OR,
   KW_ORC, KW_ORDER, KW_OUTER,
@@ -330,7 +330,7 @@ terminal
   KW_STORAGE_HANDLER_URI, KW_STORED, KW_STRAIGHT_JOIN, KW_STRING, KW_STRUCT, 
KW_SYMBOL,
   KW_SYSTEM_TIME, KW_SYSTEM_VERSION, KW_TABLE, KW_TABLES, KW_TABLESAMPLE, 
KW_TBLPROPERTIES,
   KW_TERMINATED, KW_TEXTFILE, KW_THEN, KW_TIMESTAMP, KW_TINYINT, KW_TRUNCATE, 
KW_STATS,
-  KW_TO, KW_TRUE, KW_UDF, KW_UNBOUNDED, KW_UNCACHED, KW_UNION, KW_UNIQUE, 
KW_UNKNOWN,
+  KW_TO, KW_TRAILING, KW_TRUE, KW_UDF, KW_UNBOUNDED, KW_UNCACHED, KW_UNION, 
KW_UNIQUE, KW_UNKNOWN,
   KW_UNNEST, KW_UNSET, KW_UPDATE, KW_UPDATE_FN, KW_UPSERT, KW_USE, KW_USING, 
KW_VALIDATE,
   KW_VALUES, KW_VARCHAR, KW_VIEW, KW_VIEWS, KW_WHEN, KW_WHERE, KW_WITH, 
KW_ZORDER;
 
@@ -491,6 +491,7 @@ nonterminal PartitionKeyValue partition_key_value;
 nonterminal PartitionKeyValue static_partition_key_value;
 nonterminal Qualifier opt_set_op_qualifier;
 nonterminal SetOperator set_op;
+nonterminal TrimFromExpr.TrimOption trim_option;
 
 // For ALTER DATABASE.
 nonterminal AlterDbStmt alter_db_stmt;
@@ -3841,6 +3842,15 @@ non_pred_expr ::=
   {: RESULT = s; :}
   ;
 
+trim_option ::=
+  KW_BOTH
+  {: RESULT = TrimFromExpr.TrimOption.BOTH; :}
+  | KW_LEADING
+  {: RESULT = TrimFromExpr.TrimOption.LEADING; :}
+  | KW_TRAILING
+  {: RESULT = TrimFromExpr.TrimOption.TRAILING; :}
+  ;
+
 function_call_expr ::=
   function_name:fn_name LPAREN RPAREN
   {:
@@ -3849,9 +3859,29 @@ function_call_expr ::=
   :}
   | function_name:fn_name LPAREN function_params:params RPAREN
   {: RESULT = FunctionCallExpr.createExpr(fn_name, params, 
parser.getQueryOptions()); :}
-  // Below is a special case for EXTRACT. Idents are used to avoid adding new 
keywords.
-  | function_name:fn_name LPAREN ident_or_default:u KW_FROM expr:t RPAREN
-  {:  RESULT = new ExtractFromExpr(fn_name, u, t); :}
+  // Below are special cases for function calls with KW_FROM inside. Idents 
are used to
+  // avoid adding new keywords.
+  // Common syntax for "EXTRACT(unit FROM TIMESTAMP/DATE ts)"
+  // and "TRIM(where FROM string)":
+  | function_name:fn_name LPAREN ident_or_default:i KW_FROM expr:e RPAREN
+  {:
+    if (fn_name.toString().toLowerCase().endsWith("extract")) {
+      RESULT = new ExtractFromExpr(fn_name, i, e);
+    } else {
+      // Special case for "TRIM(<charset> FROM <string>)", where  charset is a 
simple
+      // slot reference.
+      RESULT = new TrimFromExpr(fn_name, i, e);
+    }
+  :}
+  // "TRIM(<where> FROM <string>)":
+  | function_name:fn_name LPAREN trim_option:w KW_FROM expr:s RPAREN
+  {: RESULT = new TrimFromExpr(fn_name, w, s); :}
+  // Syntax: "TRIM(<charset> FROM <string>)":
+  | function_name:fn_name LPAREN expr:c KW_FROM expr:s RPAREN
+  {: RESULT = new TrimFromExpr(fn_name, c, s); :}
+  // Syntax: "TRIM(<where> <charset> FROM <string>)":
+  | function_name:fn_name LPAREN trim_option:w expr:c KW_FROM expr:s RPAREN
+  {: RESULT = new TrimFromExpr(fn_name, w, c, s); :}
   ;
 
 // TODO: allow an arbitrary expr here instead of agg/fn call, and check during 
analysis?
@@ -4315,6 +4345,8 @@ word ::=
   {: RESULT = r.toString(); :}
   | KW_BOOLEAN:r
   {: RESULT = r.toString(); :}
+  | KW_BOTH:r
+  {: RESULT = r.toString(); :}
   | KW_BUCKETS:r
   {: RESULT = r.toString(); :}
   | KW_BY:r
@@ -4509,6 +4541,8 @@ word ::=
   {: RESULT = r.toString(); :}
   | KW_LAST:r
   {: RESULT = r.toString(); :}
+  | KW_LEADING:r
+  {: RESULT = r.toString(); :}
   | KW_LEFT:r
   {: RESULT = r.toString(); :}
   | KW_LEXICAL:r
@@ -4697,6 +4731,8 @@ word ::=
   {: RESULT = r.toString(); :}
   | KW_TINYINT:r
   {: RESULT = r.toString(); :}
+  | KW_TRAILING:r
+  {: RESULT = r.toString(); :}
   | KW_TRUNCATE:r
   {: RESULT = r.toString(); :}
   | KW_STATS:r
diff --git a/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java 
b/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java
new file mode 100644
index 000000000..c2bc99b6b
--- /dev/null
+++ b/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java
@@ -0,0 +1,193 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.impala.analysis;
+
+import com.google.common.collect.Lists;
+import com.google.common.base.Enums;
+
+import org.apache.impala.catalog.BuiltinsDb;
+import org.apache.impala.catalog.Type;
+import org.apache.impala.common.AnalysisException;
+
+import java.util.stream.Collectors;
+import java.util.Arrays;
+
+/**
+ * Representation of the following TRIM expressions:
+ * TRIM(<Where> FROM <String_to_be_trimmed>)
+ * TRIM(<Characters> FROM <String_to_be_trimmed>)
+ * TRIM(<Where> <Characters> FROM <String_to_be_trimmed>)
+ * Such forms of TRIM are not handled by FunctionCallExpr.
+ *
+ * Given class serves as a syntax wrapper around FunctionCallExpr, which 
instantiates
+ * a corresponding *trim function with regular syntax based on the arguments.
+ */
+public class TrimFromExpr extends FunctionCallExpr {
+  private final FunctionName trimFromFnName_;
+  private final TrimOption where_;
+  private final Expr charset_;
+  private final Expr srcExpr_;
+  private SlotRef slotRef_ = null;
+
+  // Trim option set.
+  public enum TrimOption {
+    LEADING,
+    TRAILING,
+    BOTH
+  }
+
+  private static FunctionName baseFnName(FunctionName fnName, TrimOption 
where) {
+    String fn = fnName.toString();
+    if (where == null) {
+      // Default to BOTH if where is null
+      where = TrimOption.BOTH;
+    }
+    switch (where) {
+      case LEADING:
+        fn = fn.replaceAll("trim", "ltrim");
+        break;
+      case TRAILING:
+        fn = fn.replaceAll("trim", "rtrim");
+        break;
+      case BOTH:
+      default:
+        fn = fn.replaceAll("trim", "btrim");
+        break;
+    }
+    return new FunctionName(fn);
+  }
+
+  // For "TRIM(<where> FROM <string>)"
+  public TrimFromExpr(FunctionName fnName, TrimOption where, Expr srcexpr) {
+    this(fnName, where, null, srcexpr);
+  }
+
+  // For "TRIM(<charset> FROM <string>)"
+  public TrimFromExpr(FunctionName fnName, Expr charset, Expr srcexpr) {
+    this(fnName, null, charset, srcexpr);
+  }
+
+  // Special case for "TRIM(<charset> FROM <string>)": charset is a simple 
slot reference.
+  public TrimFromExpr(FunctionName fnName, String slot, Expr srcexpr) {
+    this(fnName, null, null, srcexpr);
+    slotRef_ = new SlotRef(Arrays.asList(slot));
+  }
+
+  // For generic "TRIM(<where> <charset> FROM <string>)"
+  public TrimFromExpr(FunctionName fnName, TrimOption where, Expr charset, 
Expr srcexpr) {
+    super(baseFnName(fnName, where),
+        charset == null ? Lists.newArrayList(srcexpr) :
+                          Lists.newArrayList(srcexpr, charset));
+    trimFromFnName_ = fnName;
+    where_ = where;
+    charset_ = charset;
+    srcExpr_ = srcexpr;
+    type_ = Type.STRING;
+  }
+
+  /**
+   * Copy c'tor used in clone().
+   */
+  protected TrimFromExpr(TrimFromExpr other) {
+    super(other);
+    trimFromFnName_ = other.trimFromFnName_;
+    where_ = other.where_;
+    charset_ = other.charset_;
+    srcExpr_ = other.srcExpr_;
+    slotRef_ = other.slotRef_;
+  }
+
+  @Override
+  protected void analyzeImpl(Analyzer analyzer) throws AnalysisException {
+    // Do these sanity checks before calling the super class to get the 
expected error
+    // messages if trim() is used in an invalid way.
+    trimFromFnName_.analyze(analyzer);
+    if (!trimFromFnName_.getFunction().equals("trim")
+        && !trimFromFnName_.getFunction().equals("utf8_trim")) {
+      throw new AnalysisException("Function "
+          + trimFromFnName_.getFunction().toUpperCase()
+          + " does not accept the keyword FROM.");
+    }
+    if ((trimFromFnName_.getDb() != null)
+        && !trimFromFnName_.getDb().equals(BuiltinsDb.NAME)) {
+      throw new AnalysisException("Function "
+          + trimFromFnName_.toString() + " conflicts "
+          + "with the TRIM builtin.");
+    }
+
+    super.analyzeImpl(analyzer);
+
+    if (slotRef_ != null) {
+      try {
+        slotRef_.analyze(analyzer);
+      } catch (AnalysisException e) {
+        // In case 'where' parameter cannot be interpreted as a slot, trim 
option error
+        // is more informative.
+        throw new AnalysisException("Trim option '" + slotRef_.toSql()
+            + "' in expression '" + toSql() + "' is invalid. Expected one of: "
+            + Arrays.stream(TrimOption.values()).map(option -> option.name())
+                .collect(Collectors.joining(", "))
+            + ". Note: TRIM-FROM syntax also accepts a column name identifier 
as a "
+            + "charset argument.");
+      }
+      if (super.getParams().size() == 1) {
+        super.getParams().exprs().add(slotRef_);
+        super.children_.add(slotRef_);
+      }
+    }
+  }
+
+  @Override
+  protected String getFunctionNotFoundError(Type[] argTypes) {
+    StringBuilder errMsg = new StringBuilder();
+    if (charset_ != null && charset_.getType() != Type.STRING) {
+      errMsg.append("Expression '" + charset_.toSql() + "' has a return type 
of "
+                    + charset_.getType().toSql() + " but a STRING is 
required.");
+    }
+    if (slotRef_ != null && slotRef_.getType() != Type.STRING) {
+      errMsg.append("Expression '" + slotRef_.toSql() + "' has a return type 
of "
+                    + slotRef_.getType().toSql() + " but a STRING is 
required.");
+    }
+    if (srcExpr_ != null && srcExpr_.getType() != Type.STRING) {
+      if (errMsg.length() > 0) errMsg.append(" ");
+      errMsg.append("Expression '" + srcExpr_.toSql() + "' has a return type 
of "
+                    + srcExpr_.getType().toSql() + " but a STRING is 
required.");
+    }
+    return errMsg.toString();
+  }
+
+  @Override
+  public String toSqlImpl(ToSqlOptions options) {
+    StringBuilder strBuilder = new StringBuilder();
+    strBuilder.append(trimFromFnName_ + "(");
+    if (where_ != null) {
+      strBuilder.append(where_.name() + " ");
+    }
+    if (slotRef_ != null) {
+      strBuilder.append(slotRef_.toSql(options) + " ");
+    }
+    if (charset_ != null) {
+      strBuilder.append(charset_.toSql(options) + " ");
+    }
+    strBuilder.append("FROM " + srcExpr_.toSql(options) + ")");
+    return strBuilder.toString();
+  }
+
+  @Override
+  public Expr clone() { return new TrimFromExpr(this); }
+}
diff --git a/fe/src/main/jflex/sql-scanner.flex 
b/fe/src/main/jflex/sql-scanner.flex
index 10856e2a3..bfbc3c72f 100644
--- a/fe/src/main/jflex/sql-scanner.flex
+++ b/fe/src/main/jflex/sql-scanner.flex
@@ -81,6 +81,7 @@ import org.apache.impala.thrift.TReservedWordsVersion;
     keywordMap.put("bigint", SqlParserSymbols.KW_BIGINT);
     keywordMap.put("binary", SqlParserSymbols.KW_BINARY);
     keywordMap.put("block_size", SqlParserSymbols.KW_BLOCKSIZE);
+    keywordMap.put("both", SqlParserSymbols.KW_BOTH);
     keywordMap.put("boolean", SqlParserSymbols.KW_BOOLEAN);
     keywordMap.put("buckets", SqlParserSymbols.KW_BUCKETS);
     keywordMap.put("by", SqlParserSymbols.KW_BY);
@@ -180,6 +181,7 @@ import org.apache.impala.thrift.TReservedWordsVersion;
     keywordMap.put("jsonfile", SqlParserSymbols.KW_JSONFILE);
     keywordMap.put("kudu", SqlParserSymbols.KW_KUDU);
     keywordMap.put("last", SqlParserSymbols.KW_LAST);
+    keywordMap.put("leading", SqlParserSymbols.KW_LEADING);
     keywordMap.put("left", SqlParserSymbols.KW_LEFT);
     keywordMap.put("lexical", SqlParserSymbols.KW_LEXICAL);
     keywordMap.put("like", SqlParserSymbols.KW_LIKE);
@@ -276,6 +278,7 @@ import org.apache.impala.thrift.TReservedWordsVersion;
     keywordMap.put("then", SqlParserSymbols.KW_THEN);
     keywordMap.put("timestamp", SqlParserSymbols.KW_TIMESTAMP);
     keywordMap.put("tinyint", SqlParserSymbols.KW_TINYINT);
+    keywordMap.put("trailing", SqlParserSymbols.KW_TRAILING);
     keywordMap.put("to", SqlParserSymbols.KW_TO);
     keywordMap.put("true", SqlParserSymbols.KW_TRUE);
     keywordMap.put("truncate", SqlParserSymbols.KW_TRUNCATE);
diff --git a/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java 
b/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
index a8a577202..5c11f4b38 100644
--- a/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
+++ b/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
@@ -1902,7 +1902,7 @@ public class AnalyzeExprsTest extends AnalyzerTest {
         "functional_orc_def.complextypes_structs",
         "No matching function with signature: lower(STRUCT<b:BOOLEAN>).");
 
-    // Special cases for FROM in function call
+    // Special cases for FROM in EXTRACT function call
     AnalyzesOk("select extract(year from now())");
     AnalyzesOk("select extract(year from cast(now() as date))");
     AnalyzesOk("select extract(year from date_col) from functional.date_tbl");
@@ -1924,6 +1924,65 @@ public class AnalyzeExprsTest extends AnalyzerTest {
     AnalysisError("select date_part(year from now())",
         "Function DATE_PART does not accept the keyword FROM");
 
+    // Special cases for FROM in TRIM in function call
+    // TRIM(where FROM string): trim space by default
+    AnalyzesOk("select trim(trailing FROM '     &@&127+  &@   ')");
+    AnalyzesOk("select tRIm(trailing FROM '     &@&127+  &@   ')");
+    AnalyzesOk("select utf8_trim(trailing FROM '     &@&127+  &@   ')");
+    AnalysisError("select trim(foo from '     &@&127+  &@   ')",
+        "Trim option 'foo' in expression 'trim(foo FROM '     &@&127+  &@   
')' is " +
+        "invalid. Expected one of: LEADING, TRAILING, BOTH.");
+    AnalysisError("select utf8_trim(foo from '     &@&127+  &@   ')",
+        "Trim option 'foo' in expression 'utf8_trim(foo FROM '     &@&127+  &@ 
  ')' " +
+        "is invalid. Expected one of: LEADING, TRAILING, BOTH.");
+    AnalysisError("select trim(leading from 0)",
+        "Expression '0' has a return type of TINYINT but a STRING is 
required.");
+    AnalysisError("select date_part(both from ' xyz ')",
+        "Function DATE_PART does not accept the keyword FROM.");
+
+    // TRIM(string FROM string): trim from both sides by default
+    AnalyzesOk("select trim('+-*' from '&$^$)*(^*^++--*')");
+    AnalyzesOk("select trim(NULL from '&$^$)*(^*^++--*')");
+    AnalyzesOk("select trim('' from '&$^$)*(^*^++--*')");
+    AnalysisError("select trim('cosmos' from 10)",
+        "Expression '10' has a return type of TINYINT but a STRING is 
required.");
+    AnalysisError("select trim(100 from '  universe  ');",
+        "Expression '100' has a return type of TINYINT but a STRING is 
required.");
+    AnalysisError("select trim(3.1415926 from 2.718281828);",
+        "Expression '3.1415926' has a return type of DECIMAL(8,7) but a STRING 
is " +
+        "required. Expression '2.718281828' has a return type of DECIMAL(10,9) 
but a " +
+        "STRING is required.");
+    AnalysisError("select trim(int_col from 'abc') from functional.alltypes",
+        "Expression 'int_col' has a return type of INT but a STRING is 
required.");
+
+    // TRIM(where string FROM string): regular test cases
+    AnalyzesOk("select trim(trailing 'a0-' from 'c2aa0a+&$%-a00000-a')");
+    AnalyzesOk("select trim(leAdiNG 'rt' From 'rrrrssssstttttt')");
+    AnalyzesOk("select trim(tRAIlinG 'rt' FROM 'rrrrssssstttttt')");
+    AnalysisError("select trim(xyz ' ' from now())",
+        "Syntax error in line 1:\n" +
+        "select trim(xyz ' ' from now())\n" +
+        "                ^\n" +
+        "Encountered: STRING LITERAL\n" +
+        "Expected: AND, BETWEEN, DIV, FROM, IGNORE, ILIKE, IN, IREGEXP, IS, 
LIKE, ||, " +
+        "NOT, OR, REGEXP, RLIKE, COMMA\n\n");
+    AnalysisError("select trim(both ' ' from now())",
+        "Expression 'now()' has a return type of TIMESTAMP but a STRING is 
required.");
+    AnalysisError("select trim(xyz ' ' from ' Quantum ')",
+        "Syntax error in line 1:\n" +
+        "select trim(xyz ' ' from ' Quantum ')\n" +
+        "                ^\n" +
+        "Encountered: STRING LITERAL\n" +
+        "Expected: AND, BETWEEN, DIV, FROM, IGNORE, ILIKE, IN, IREGEXP, IS, 
LIKE, ||, " +
+        "NOT, OR, REGEXP, RLIKE, COMMA\n\n");
+    AnalysisError("select trim(both 6.022140857E23 from '  Avogadro constant 
')",
+        "Expression '6.022140857E+23' has a return type of DECIMAL(24,0) but a 
STRING " +
+        "is required.");
+    AnalysisError("select trim(both 1.6E-19 from 6.62606896E-34)",
+        "Expression '1.6E-19' has a return type of DECIMAL(20,20) but a STRING 
is " +
+        "required. Expression '6.62606896E-34' has a return type of DOUBLE but 
a " +
+        "STRING is required.");
+
     // IGNORE NULLS may only be used with first_value/last_value
     AnalysisError("select lower('FOO' ignore nulls)",
         "Function LOWER does not accept the keyword IGNORE NULLS.");
diff --git a/fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java 
b/fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
index d17465f81..4ef81b33e 100644
--- a/fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
+++ b/fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
@@ -1507,6 +1507,13 @@ public class ToSqlTest extends FrontendTestBase {
     // FunctionCallExpr.
     testToSql("select pi(), (pi()), trim('a'), (trim('a'))",
         "SELECT pi(), (pi()), trim('a'), (trim('a'))");
+    // TrimFromExpr.
+    testToSql("select trim('aa' from 'abc'), (trim('aa' from 'abc')), " +
+        "trim(leading 'aa' from 'abc'), trim(string_col from 'string') " +
+        "from functional.alltypes",
+        "SELECT trim('aa' FROM 'abc'), (trim('aa' FROM 'abc')), " +
+        "trim(LEADING 'aa' FROM 'abc'), trim(string_col FROM 'string') " +
+        "FROM functional.alltypes");
     // LiteralExpr.
     testToSql("select 10, (10), 20.0, (20.0), NULL, (NULL), 'abc', ('abc')",
         "SELECT 10, (10), 20.0, (20.0), NULL, (NULL), 'abc', ('abc')");
diff --git 
a/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
 
b/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
index 7a4690113..d506de4c1 100644
--- 
a/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
+++ 
b/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
@@ -329,6 +329,23 @@ INT,STRING,STRING
 ====
 ---- QUERY
 set utf8_mode=true;
+select id, name, trim(leading substr(name, 1, 1) from name) from utf8_str_tiny;
+---- RESULTS: RAW_STRING
+1,'张三','三'
+2,'李四','四'
+3,'王五','五'
+4,'李小龙','小龙'
+5,'Alice','lice'
+6,'陈Bob','Bob'
+7,'Бopиc','opиc'
+8,'Jörg','örg'
+9,'ひなた','なた'
+10,'서연','연'
+---- TYPES
+INT,STRING,STRING
+====
+---- QUERY
+set utf8_mode=true;
 select id, name, rtrim(name, substr(name, 2)) from utf8_str_tiny;
 ---- RESULTS: RAW_STRING
 1,'张三','张'
@@ -346,6 +363,23 @@ INT,STRING,STRING
 ====
 ---- QUERY
 set utf8_mode=true;
+select id, name, trim(trailing substr(name, 2) from name) from utf8_str_tiny;
+---- RESULTS: RAW_STRING
+1,'张三','张'
+2,'李四','李'
+3,'王五','王'
+4,'李小龙','李'
+5,'Alice','A'
+6,'陈Bob','陈'
+7,'Бopиc','Б'
+8,'Jörg','J'
+9,'ひなた','ひ'
+10,'서연','서'
+---- TYPES
+INT,STRING,STRING
+====
+---- QUERY
+set utf8_mode=true;
 select id, name, ltrim(name, '张李王小A陈БJひ서') from utf8_str_tiny;
 ---- RESULTS: RAW_STRING
 1,'张三','三'
@@ -363,6 +397,23 @@ INT,STRING,STRING
 ====
 ---- QUERY
 set utf8_mode=true;
+select id, name, trim(leading '张李王小A陈БJひ서' from name) from utf8_str_tiny;
+---- RESULTS: RAW_STRING
+1,'张三','三'
+2,'李四','四'
+3,'王五','五'
+4,'李小龙','龙'
+5,'Alice','lice'
+6,'陈Bob','Bob'
+7,'Бopиc','opиc'
+8,'Jörg','örg'
+9,'ひなた','なた'
+10,'서연','연'
+---- TYPES
+INT,STRING,STRING
+====
+---- QUERY
+set utf8_mode=true;
 select id, name, rtrim(name, '三四五小龙') from utf8_str_tiny;
 ---- RESULTS: RAW_STRING
 1,'张三','张'
@@ -380,6 +431,23 @@ INT,STRING,STRING
 ====
 ---- QUERY
 set utf8_mode=true;
+select id, name, trim(trailing '三四五小龙' from name) from utf8_str_tiny;
+---- RESULTS: RAW_STRING
+1,'张三','张'
+2,'李四','李'
+3,'王五','王'
+4,'李小龙','李'
+5,'Alice','Alice'
+6,'陈Bob','陈Bob'
+7,'Бopиc','Бopиc'
+8,'Jörg','Jörg'
+9,'ひなた','ひなた'
+10,'서연','서연'
+---- TYPES
+INT,STRING,STRING
+====
+---- QUERY
+set utf8_mode=true;
 select id, name, btrim(name, '！？。，：；‘’“”≠≥≤∞ε∑∫√') from utf8_str_tiny;
 ---- RESULTS: RAW_STRING
 1,'张三','张三'
@@ -395,3 +463,117 @@ select id, name, btrim(name, '！？。，：；‘’“”≠≥≤∞ε∑∫
 ---- TYPES
 INT,STRING,STRING
 ====
+---- QUERY
+set utf8_mode=true;
+select id, name, trim('！？。，：；‘’“”≠≥≤∞ε∑∫√' from name) from utf8_str_tiny;
+---- RESULTS: RAW_STRING
+1,'张三','张三'
+2,'李四','李四'
+3,'王五','王五'
+4,'李小龙','李小龙'
+5,'Alice','Alice'
+6,'陈Bob','陈Bob'
+7,'Бopиc','Бopиc'
+8,'Jörg','Jörg'
+9,'ひなた','ひなた'
+10,'서연','서연'
+---- TYPES
+INT,STRING,STRING
+====
+---- QUERY
+set utf8_mode=false;
+select id, name, utf8_trim('！？。，：；‘’“”≠≥≤∞ε∑∫√' from name) from utf8_str_tiny;
+---- RESULTS: RAW_STRING
+1,'张三','张三'
+2,'李四','李四'
+3,'王五','王五'
+4,'李小龙','李小龙'
+5,'Alice','Alice'
+6,'陈Bob','陈Bob'
+7,'Бopиc','Бopиc'
+8,'Jörg','Jörg'
+9,'ひなた','ひなた'
+10,'서연','서연'
+---- TYPES
+INT,STRING,STRING
+====
+---- QUERY
+set utf8_mode=false;
+select id, name, utf8_trim('！？。，：；‘’“”≠≥≤∞ε∑∫√' from name) from utf8_str_tiny;
+---- RESULTS: RAW_STRING
+1,'张三','张三'
+2,'李四','李四'
+3,'王五','王五'
+4,'李小龙','李小龙'
+5,'Alice','Alice'
+6,'陈Bob','陈Bob'
+7,'Бopиc','Бopиc'
+8,'Jörg','Jörg'
+9,'ひなた','ひなた'
+10,'서연','서연'
+---- TYPES
+INT,STRING,STRING
+====
+---- QUERY
+set utf8_mode=true;
+select id, unnest(arr2) as item, trim(unnest(arr2) from "aaYYBBoo") from 
complextypes_arrays order by id, item;
+---- RESULTS: RAW_STRING
+1,'five','aaYYBBoo'
+1,'four','aaYYBB'
+1,'one','aaYYBB'
+1,'three','aaYYBBoo'
+1,'two','aaYYBB'
+2,'five','aaYYBBoo'
+2,'one','aaYYBB'
+2,'three','aaYYBBoo'
+2,'two','aaYYBB'
+2,'NULL','aaYYBBoo'
+3,'ten','aaYYBBoo'
+4,'eight','aaYYBBoo'
+4,'nine','aaYYBBoo'
+4,'ten','aaYYBBoo'
+5,'eleven','aaYYBBoo'
+5,'ten','aaYYBBoo'
+5,'thirteen','aaYYBBoo'
+5,'twelve','aaYYBBoo'
+6,'str1','aaYYBBoo'
+6,'str2','aaYYBBoo'
+9,'str1','aaYYBBoo'
+9,'str2','aaYYBBoo'
+---- TYPES
+INT,STRING,STRING
+====
+---- QUERY
+set utf8_mode=true;
+select id, name, trim(first_value(name) over (order by id) from name) from 
utf8_str_tiny;
+---- RESULTS: RAW_STRING
+1,'张三',''
+2,'李四','李四'
+3,'王五','王五'
+4,'李小龙','李小龙'
+5,'Alice','Alice'
+6,'陈Bob','陈Bob'
+7,'Бopиc','Бopиc'
+8,'Jörg','Jörg'
+9,'ひなた','ひなた'
+10,'서연','서연'
+---- TYPES
+INT,STRING,STRING
+====
+---- QUERY
+set utf8_mode=true;
+select id, name, trim(name from name) from utf8_str_tiny;
+---- RESULTS: RAW_STRING
+1,'张三',''
+2,'李四',''
+3,'王五',''
+4,'李小龙',''
+5,'Alice',''
+6,'陈Bob',''
+7,'Бopиc',''
+8,'Jörg',''
+9,'ひなた',''
+10,'서연',''
+---- TYPES
+INT,STRING,STRING
+====

(impala) branch master updated: IMPALA-889: Add trim() function matching ANSI SQL definition

Reply via email to