Copilot commented on code in PR #38452:
URL: https://github.com/apache/superset/pull/38452#discussion_r2892514234


##########
superset/models/helpers.py:
##########
@@ -218,6 +219,32 @@ def validate_adhoc_subquery(
     return parsed_statement.format()
 
 
+def validate_rls_clause(
+    sql: str,
+    engine: str,
+) -> None:
+    """
+    Ensure the RLS clause is a valid SQL fragment.
+    This is a lighter validation than validate_adhoc_subquery as it
+    is intended for administrator-defined security rules.
+
+    :param sql: RLS clause expression
+    :raise SupersetParseError if sql is syntactically invalid
+    """
+    from superset.sql.parse import SQLStatement
+
+    # We just need to ensure it parses correctly as a predicate/expression
+    parsed_statement = SQLStatement(sql, engine)
+    if parsed_statement.has_subquery():
+        raise SupersetSecurityException(
+            SupersetError(
+                error_type=SupersetErrorType.ADHOC_SUBQUERY_NOT_ALLOWED_ERROR,
+                message=_("Custom SQL fields cannot contain sub-queries."),
+                level=ErrorLevel.ERROR,
+            )
+        )
+
+

Review Comment:
   The `validate_rls_clause` function is defined but never called anywhere in 
the codebase. It is not imported or referenced by any file, including the RLS 
API (`superset/row_level_security/api.py`), the RLS commands 
(`superset/commands/security/create.py`, `update.py`), or the 
`get_sqla_row_level_filters` method in `superset/connectors/sqla/models.py`.
   
   For this feature to actually prevent subqueries in RLS clauses, the function 
must be wired into the RLS rule creation and update flow (e.g., called in 
`CreateRLSRuleCommand.validate()` or `UpdateRLSRuleCommand.validate()`). 
Without this integration, adding the function provides no security benefit.
   ```suggestion
   
   ```



##########
superset/sql/parse.py:
##########
@@ -878,14 +878,31 @@ def has_subquery(self) -> bool:
 
         :return: True if the statement has a subquery.
         """
-        return bool(self._parsed.find(exp.Subquery)) or (
-            isinstance(self._parsed, exp.Select)
-            and any(
-                isinstance(expression, exp.Select)
-                for expression in self._parsed.walk()
-                if expression != self._parsed
-            )
-        )
+        # catch nested subqueries
+        if bool(self._parsed.find(exp.Subquery)):
+            return True
+
+        # recursive walker to catch all forms of sub-statements
+        for node in self._parsed.walk():
+            # don't catch the root Select/Union etc.
+            if node == self._parsed:
+                continue
+            if isinstance(
+                node,
+                (
+                    exp.Select,
+                    exp.Union,
+                    exp.Except,
+                    exp.Intersect,
+                    exp.Update,
+                    exp.Delete,
+                    exp.Insert,
+                    exp.Merge,
+                ),

Review Comment:
   In the `isinstance` check, `exp.Except` and `exp.Intersect` are both 
subclasses of `exp.Union` in sqlglot. Including all three in the tuple is 
redundant: any instance of `exp.Except` or `exp.Intersect` will already match 
`isinstance(node, exp.Union)`. The redundant entries can be removed to simplify 
the code.



##########
tests/unit_tests/sql/parse_tests.py:
##########
@@ -3134,3 +3134,28 @@ def test_backtick_invalid_sql_still_fails() -> None:
     sql = "SELECT * FROM `table` WHERE"
     with pytest.raises(SupersetParseError):
         SQLScript(sql, "base")
+
+
+def test_extract_tables_raw_expression() -> None:
+    """
+    Test that tables inside raw expressions (like RLS clauses) are parsed 
correctly.
+    """
+    # basic subquery inside raw expression
+    assert extract_tables_from_sql(
+        "bot_profile__name = 'giphy' OR "
+        "EXISTS (SELECT 1 FROM public.birth_names WHERE gender='girl')"
+    ) == {Table("birth_names", "public")}
+
+    # nested subquery with order by and limit
+    assert extract_tables_from_sql(
+        "bot_profile__name = 'giphy' OR EXISTS (SELECT 1 FROM (SELECT 
table_name "
+        "FROM information_schema.tables WHERE table_schema='public' ORDER BY "
+        "table_name ASC LIMIT 1 OFFSET 0) AS t WHERE "
+        "ascii(substring(t.table_name, 1, 1)) = 32)"
+    ) == {Table("tables", "information_schema")}
+
+    # ensure CTE with the same name as the table does not filter out the 
actual table
+    assert extract_tables_from_sql(
+        "bot_profile__name = 'giphy' OR EXISTS (WITH birth_names AS (SELECT 1 "
+        "FROM public.birth_names) SELECT 1 FROM birth_names)"
+    ) == {Table("birth_names", "public")}

Review Comment:
   The new tests in `test_extract_tables_raw_expression` test the 
`extract_tables_from_sql` helper which calls `extract_tables_from_statement`. 
However, the PR's core fix is about `has_subquery()` correctly detecting 
subqueries in raw SQL expressions like RLS predicates. There are no new test 
cases added to `test_has_subquery` covering raw expressions (e.g., `"EXISTS 
(SELECT 1 FROM some_table)"` as a standalone predicate). Without these test 
cases, the regression being described in the PR (subqueries in RLS filters not 
being detected) is not actually tested.



##########
superset/models/helpers.py:
##########
@@ -191,6 +190,8 @@ def validate_adhoc_subquery(
     default_schema: str,
     engine: str,
 ) -> str:
+    from superset.sql.parse import SQLStatement
+

Review Comment:
   The local import `from superset.sql.parse import SQLStatement` was inserted 
before the existing docstring of `validate_adhoc_subquery`, which means the 
docstring is no longer recognized as a Python docstring (it becomes an 
unreachable string literal instead). In Python, the docstring must be the first 
statement in the function body. The import should be moved after the docstring, 
or the docstring should be placed before any executable statements.



##########
superset/sql/parse.py:
##########
@@ -878,14 +878,31 @@ def has_subquery(self) -> bool:
 
         :return: True if the statement has a subquery.
         """
-        return bool(self._parsed.find(exp.Subquery)) or (
-            isinstance(self._parsed, exp.Select)
-            and any(
-                isinstance(expression, exp.Select)
-                for expression in self._parsed.walk()
-                if expression != self._parsed
-            )
-        )
+        # catch nested subqueries
+        if bool(self._parsed.find(exp.Subquery)):
+            return True
+
+        # recursive walker to catch all forms of sub-statements
+        for node in self._parsed.walk():
+            # don't catch the root Select/Union etc.
+            if node == self._parsed:
+                continue
+            if isinstance(
+                node,
+                (
+                    exp.Select,
+                    exp.Union,
+                    exp.Except,
+                    exp.Intersect,
+                    exp.Update,
+                    exp.Delete,
+                    exp.Insert,
+                    exp.Merge,
+                ),
+            ):
+                return True

Review Comment:
   The new `has_subquery` implementation has a behavioral regression for 
top-level UNION/INTERSECT/EXCEPT queries. When the root AST node is `exp.Union` 
(e.g., `SELECT * FROM t1 UNION SELECT * FROM t2`), the walk will visit both 
constituent `exp.Select` child nodes. Since they are not equal to 
`self._parsed` (the Union root), they match `isinstance(node, exp.Select)` and 
the method incorrectly returns `True`.
   
   Under the old implementation, the second condition required 
`isinstance(self._parsed, exp.Select)`, which meant top-level UNION queries 
returned `False`. Any adhoc SQL expressions containing a top-level UNION (even 
if technically valid) would now be incorrectly classified as subqueries and 
could be blocked when `ALLOW_ADHOC_SUBQUERY` is disabled.
   
   A fix could be to also skip the direct children of the root when the root is 
a Union/Except/Intersect, or to check whether the `exp.Select` or `exp.Union` 
node is actually nested (i.e., appears as a subquery context rather than a 
sibling select in a set operation). One approach would be to check if the 
matched node is contained within an `exp.Subquery`, `exp.In`, `exp.Exists`, or 
similar construct.



##########
tests/unit_tests/sql/parse_tests.py:
##########
@@ -3134,3 +3134,28 @@ def test_backtick_invalid_sql_still_fails() -> None:
     sql = "SELECT * FROM `table` WHERE"
     with pytest.raises(SupersetParseError):
         SQLScript(sql, "base")
+
+
+def test_extract_tables_raw_expression() -> None:
+    """
+    Test that tables inside raw expressions (like RLS clauses) are parsed 
correctly.
+    """
+    # basic subquery inside raw expression
+    assert extract_tables_from_sql(
+        "bot_profile__name = 'giphy' OR "
+        "EXISTS (SELECT 1 FROM public.birth_names WHERE gender='girl')"
+    ) == {Table("birth_names", "public")}
+
+    # nested subquery with order by and limit
+    assert extract_tables_from_sql(
+        "bot_profile__name = 'giphy' OR EXISTS (SELECT 1 FROM (SELECT 
table_name "
+        "FROM information_schema.tables WHERE table_schema='public' ORDER BY "
+        "table_name ASC LIMIT 1 OFFSET 0) AS t WHERE "
+        "ascii(substring(t.table_name, 1, 1)) = 32)"
+    ) == {Table("tables", "information_schema")}
+
+    # ensure CTE with the same name as the table does not filter out the 
actual table
+    assert extract_tables_from_sql(
+        "bot_profile__name = 'giphy' OR EXISTS (WITH birth_names AS (SELECT 1 "
+        "FROM public.birth_names) SELECT 1 FROM birth_names)"
+    ) == {Table("birth_names", "public")}

Review Comment:
   The PR description states "Enhanced extract_tables_from_statement to 
explicitly traverse subquery nodes when standard scope traversal yields no 
results", but no changes were made to `extract_tables_from_statement` in the 
diff. The new tests in `test_extract_tables_raw_expression` test the table 
extraction behavior for raw SQL expressions but the implementation was not 
actually modified. This could mean the tests are either verifying 
already-working behavior (making the tests useful as non-regression tests) or 
that the tests may fail due to the described enhancement not being implemented.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to