Copilot commented on code in PR #38452:
URL: https://github.com/apache/superset/pull/38452#discussion_r2892514234
##########
superset/models/helpers.py:
##########
@@ -218,6 +219,32 @@ def validate_adhoc_subquery(
return parsed_statement.format()
+def validate_rls_clause(
+ sql: str,
+ engine: str,
+) -> None:
+ """
+ Ensure the RLS clause is a valid SQL fragment.
+ This is a lighter validation than validate_adhoc_subquery as it
+ is intended for administrator-defined security rules.
+
+ :param sql: RLS clause expression
+ :raise SupersetParseError if sql is syntactically invalid
+ """
+ from superset.sql.parse import SQLStatement
+
+ # We just need to ensure it parses correctly as a predicate/expression
+ parsed_statement = SQLStatement(sql, engine)
+ if parsed_statement.has_subquery():
+ raise SupersetSecurityException(
+ SupersetError(
+ error_type=SupersetErrorType.ADHOC_SUBQUERY_NOT_ALLOWED_ERROR,
+ message=_("Custom SQL fields cannot contain sub-queries."),
+ level=ErrorLevel.ERROR,
+ )
+ )
+
+
Review Comment:
The `validate_rls_clause` function is defined but never called anywhere in
the codebase. It is not imported or referenced by any file, including the RLS
API (`superset/row_level_security/api.py`), the RLS commands
(`superset/commands/security/create.py`, `update.py`), or the
`get_sqla_row_level_filters` method in `superset/connectors/sqla/models.py`.
For this feature to actually prevent subqueries in RLS clauses, the function
must be wired into the RLS rule creation and update flow (e.g., called in
`CreateRLSRuleCommand.validate()` or `UpdateRLSRuleCommand.validate()`).
Without this integration, adding the function provides no security benefit.
```suggestion
```
##########
superset/sql/parse.py:
##########
@@ -878,14 +878,31 @@ def has_subquery(self) -> bool:
:return: True if the statement has a subquery.
"""
- return bool(self._parsed.find(exp.Subquery)) or (
- isinstance(self._parsed, exp.Select)
- and any(
- isinstance(expression, exp.Select)
- for expression in self._parsed.walk()
- if expression != self._parsed
- )
- )
+ # catch nested subqueries
+ if bool(self._parsed.find(exp.Subquery)):
+ return True
+
+ # recursive walker to catch all forms of sub-statements
+ for node in self._parsed.walk():
+ # don't catch the root Select/Union etc.
+ if node == self._parsed:
+ continue
+ if isinstance(
+ node,
+ (
+ exp.Select,
+ exp.Union,
+ exp.Except,
+ exp.Intersect,
+ exp.Update,
+ exp.Delete,
+ exp.Insert,
+ exp.Merge,
+ ),
Review Comment:
In the `isinstance` check, `exp.Except` and `exp.Intersect` are both
subclasses of `exp.Union` in sqlglot. Including all three in the tuple is
redundant: any instance of `exp.Except` or `exp.Intersect` will already match
`isinstance(node, exp.Union)`. The redundant entries can be removed to simplify
the code.
##########
tests/unit_tests/sql/parse_tests.py:
##########
@@ -3134,3 +3134,28 @@ def test_backtick_invalid_sql_still_fails() -> None:
sql = "SELECT * FROM `table` WHERE"
with pytest.raises(SupersetParseError):
SQLScript(sql, "base")
+
+
+def test_extract_tables_raw_expression() -> None:
+ """
+ Test that tables inside raw expressions (like RLS clauses) are parsed
correctly.
+ """
+ # basic subquery inside raw expression
+ assert extract_tables_from_sql(
+ "bot_profile__name = 'giphy' OR "
+ "EXISTS (SELECT 1 FROM public.birth_names WHERE gender='girl')"
+ ) == {Table("birth_names", "public")}
+
+ # nested subquery with order by and limit
+ assert extract_tables_from_sql(
+ "bot_profile__name = 'giphy' OR EXISTS (SELECT 1 FROM (SELECT
table_name "
+ "FROM information_schema.tables WHERE table_schema='public' ORDER BY "
+ "table_name ASC LIMIT 1 OFFSET 0) AS t WHERE "
+ "ascii(substring(t.table_name, 1, 1)) = 32)"
+ ) == {Table("tables", "information_schema")}
+
+ # ensure CTE with the same name as the table does not filter out the
actual table
+ assert extract_tables_from_sql(
+ "bot_profile__name = 'giphy' OR EXISTS (WITH birth_names AS (SELECT 1 "
+ "FROM public.birth_names) SELECT 1 FROM birth_names)"
+ ) == {Table("birth_names", "public")}
Review Comment:
The new tests in `test_extract_tables_raw_expression` test the
`extract_tables_from_sql` helper which calls `extract_tables_from_statement`.
However, the PR's core fix is about `has_subquery()` correctly detecting
subqueries in raw SQL expressions like RLS predicates. There are no new test
cases added to `test_has_subquery` covering raw expressions (e.g., `"EXISTS
(SELECT 1 FROM some_table)"` as a standalone predicate). Without these test
cases, the regression being described in the PR (subqueries in RLS filters not
being detected) is not actually tested.
##########
superset/models/helpers.py:
##########
@@ -191,6 +190,8 @@ def validate_adhoc_subquery(
default_schema: str,
engine: str,
) -> str:
+ from superset.sql.parse import SQLStatement
+
Review Comment:
The local import `from superset.sql.parse import SQLStatement` was inserted
before the existing docstring of `validate_adhoc_subquery`, which means the
docstring is no longer recognized as a Python docstring (it becomes an
unreachable string literal instead). In Python, the docstring must be the first
statement in the function body. The import should be moved after the docstring,
or the docstring should be placed before any executable statements.
##########
superset/sql/parse.py:
##########
@@ -878,14 +878,31 @@ def has_subquery(self) -> bool:
:return: True if the statement has a subquery.
"""
- return bool(self._parsed.find(exp.Subquery)) or (
- isinstance(self._parsed, exp.Select)
- and any(
- isinstance(expression, exp.Select)
- for expression in self._parsed.walk()
- if expression != self._parsed
- )
- )
+ # catch nested subqueries
+ if bool(self._parsed.find(exp.Subquery)):
+ return True
+
+ # recursive walker to catch all forms of sub-statements
+ for node in self._parsed.walk():
+ # don't catch the root Select/Union etc.
+ if node == self._parsed:
+ continue
+ if isinstance(
+ node,
+ (
+ exp.Select,
+ exp.Union,
+ exp.Except,
+ exp.Intersect,
+ exp.Update,
+ exp.Delete,
+ exp.Insert,
+ exp.Merge,
+ ),
+ ):
+ return True
Review Comment:
The new `has_subquery` implementation has a behavioral regression for
top-level UNION/INTERSECT/EXCEPT queries. When the root AST node is `exp.Union`
(e.g., `SELECT * FROM t1 UNION SELECT * FROM t2`), the walk will visit both
constituent `exp.Select` child nodes. Since they are not equal to
`self._parsed` (the Union root), they match `isinstance(node, exp.Select)` and
the method incorrectly returns `True`.
Under the old implementation, the second condition required
`isinstance(self._parsed, exp.Select)`, which meant top-level UNION queries
returned `False`. Any adhoc SQL expressions containing a top-level UNION (even
if technically valid) would now be incorrectly classified as subqueries and
could be blocked when `ALLOW_ADHOC_SUBQUERY` is disabled.
A fix could be to also skip the direct children of the root when the root is
a Union/Except/Intersect, or to check whether the `exp.Select` or `exp.Union`
node is actually nested (i.e., appears as a subquery context rather than a
sibling select in a set operation). One approach would be to check if the
matched node is contained within an `exp.Subquery`, `exp.In`, `exp.Exists`, or
similar construct.
##########
tests/unit_tests/sql/parse_tests.py:
##########
@@ -3134,3 +3134,28 @@ def test_backtick_invalid_sql_still_fails() -> None:
sql = "SELECT * FROM `table` WHERE"
with pytest.raises(SupersetParseError):
SQLScript(sql, "base")
+
+
+def test_extract_tables_raw_expression() -> None:
+ """
+ Test that tables inside raw expressions (like RLS clauses) are parsed
correctly.
+ """
+ # basic subquery inside raw expression
+ assert extract_tables_from_sql(
+ "bot_profile__name = 'giphy' OR "
+ "EXISTS (SELECT 1 FROM public.birth_names WHERE gender='girl')"
+ ) == {Table("birth_names", "public")}
+
+ # nested subquery with order by and limit
+ assert extract_tables_from_sql(
+ "bot_profile__name = 'giphy' OR EXISTS (SELECT 1 FROM (SELECT
table_name "
+ "FROM information_schema.tables WHERE table_schema='public' ORDER BY "
+ "table_name ASC LIMIT 1 OFFSET 0) AS t WHERE "
+ "ascii(substring(t.table_name, 1, 1)) = 32)"
+ ) == {Table("tables", "information_schema")}
+
+ # ensure CTE with the same name as the table does not filter out the
actual table
+ assert extract_tables_from_sql(
+ "bot_profile__name = 'giphy' OR EXISTS (WITH birth_names AS (SELECT 1 "
+ "FROM public.birth_names) SELECT 1 FROM birth_names)"
+ ) == {Table("birth_names", "public")}
Review Comment:
The PR description states "Enhanced extract_tables_from_statement to
explicitly traverse subquery nodes when standard scope traversal yields no
results", but no changes were made to `extract_tables_from_statement` in the
diff. The new tests in `test_extract_tables_raw_expression` test the table
extraction behavior for raw SQL expressions but the implementation was not
actually modified. This could mean the tests are either verifying
already-working behavior (making the tests useful as non-regression tests) or
that the tests may fail due to the described enhancement not being implemented.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]