Re: [PR] AIP-99: Add DataFusionToolset [airflow]

via GitHub Wed, 04 Mar 2026 13:54:53 -0800


kaxil commented on PR #62850:
URL: https://github.com/apache/airflow/pull/62850#issuecomment-4000541971


   ## Code Review
   
   ### [HIGH] Debug `print()` left in `_get_schema`
   
   ```python
   def _get_schema(self, table_name: str) -> str:
       engine = self._get_engine()
       if table_name not in engine.registered_tables:
           return json.dumps({"error": f"Table {table_name!r} is not 
available"})
       print(engine.get_schema(table_name))  # <-- debug leftover
       return engine.get_schema(table_name)
   ```
   
   This `print()` leaks schema info to stdout in production and calls 
`engine.get_schema()` twice per invocation. Remove it.
   
   ---
   
   ### [HIGH] `_get_schema` returns non-JSON on success, JSON on error
   
   `DataFusionEngine.get_schema()` returns `str(arrow_schema)` — a plain-text 
Arrow schema representation (e.g., `"id: int64\namount: float64"`). But the 
error branch returns `json.dumps({"error": ...})`. The LLM gets inconsistent 
response formats from the same tool.
   
   The test masks this by mocking `engine.get_schema.return_value = 
json.dumps([...])` — in production, `DataFusionEngine.get_schema` does NOT 
return JSON.
   
   **Fix**: Parse the Arrow schema into structured JSON to match the error path 
and `SQLToolset`'s pattern:
   
   ```python
   table = engine.session_context.table(table_name)
   schema = table.schema()
   columns = [{"name": f.name, "type": str(f.type)} for f in schema]
   return json.dumps(columns)
   ```
   
   At minimum, the test should use a realistic return value matching what 
`DataFusionEngine.get_schema` actually returns.
   
   ---
   
   ### [MEDIUM] Bare `except Exception` in `_query` swallows programming errors
   
   ```python
   except Exception as ex:
       return json.dumps({"error": str(ex), "query": sql})
   ```
   
   This catches everything including `TypeError`, `AttributeError`, etc. — 
programming bugs in the toolset itself get silently returned as JSON to the LLM 
rather than surfacing for debugging. `SQLToolset._query` does not do this.
   
   Catch only expected exceptions (`SQLSafetyError`, query execution errors) or 
at minimum log the exception before returning it.
   
   ---
   
   ### [MEDIUM] Inconsistent error handling across the three tools
   
   `_query` catches exceptions → returns error JSON. `_list_tables` and 
`_get_schema` let exceptions propagate uncaught. Within the same toolset, the 
LLM gets different error behavior depending on which tool fails. Pick one 
approach and apply it consistently.
   
   ---
   
   ### [MEDIUM] `allow_writes` should be keyword-only
   
   ```python
   def __init__(
       self,
       datasource_configs: list[DataSourceConfig],
       allow_writes: bool = False,  # positional
       *,
       max_rows: int = 50,
   ) -> None:
   ```
   
   `SQLToolset` makes `allow_writes` keyword-only (after `*`). For a 
security-sensitive parameter that enables data-modifying SQL, positional usage 
is error-prone — `DataFusionToolset([cfg], True)` reads poorly. Move it after 
`*` to match `SQLToolset`.
   
   ---
   
   ### [LOW] Test name `test_id_includes_sorted_table_names` is misleading
   
   The test asserts `ts.id == "sql_datafusion_beta_alpha"` — that's insertion 
order, not sorted order. Either sort the table names in the `id` property, or 
rename the test to `test_id_includes_table_names_in_order`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] AIP-99: Add DataFusionToolset [airflow]

Reply via email to