kaxil commented on PR #62850:
URL: https://github.com/apache/airflow/pull/62850#issuecomment-4000541971
## Code Review
### [HIGH] Debug `print()` left in `_get_schema`
```python
def _get_schema(self, table_name: str) -> str:
engine = self._get_engine()
if table_name not in engine.registered_tables:
return json.dumps({"error": f"Table {table_name!r} is not
available"})
print(engine.get_schema(table_name)) # <-- debug leftover
return engine.get_schema(table_name)
```
This `print()` leaks schema info to stdout in production and calls
`engine.get_schema()` twice per invocation. Remove it.
---
### [HIGH] `_get_schema` returns non-JSON on success, JSON on error
`DataFusionEngine.get_schema()` returns `str(arrow_schema)` — a plain-text
Arrow schema representation (e.g., `"id: int64\namount: float64"`). But the
error branch returns `json.dumps({"error": ...})`. The LLM gets inconsistent
response formats from the same tool.
The test masks this by mocking `engine.get_schema.return_value =
json.dumps([...])` — in production, `DataFusionEngine.get_schema` does NOT
return JSON.
**Fix**: Parse the Arrow schema into structured JSON to match the error path
and `SQLToolset`'s pattern:
```python
table = engine.session_context.table(table_name)
schema = table.schema()
columns = [{"name": f.name, "type": str(f.type)} for f in schema]
return json.dumps(columns)
```
At minimum, the test should use a realistic return value matching what
`DataFusionEngine.get_schema` actually returns.
---
### [MEDIUM] Bare `except Exception` in `_query` swallows programming errors
```python
except Exception as ex:
return json.dumps({"error": str(ex), "query": sql})
```
This catches everything including `TypeError`, `AttributeError`, etc. —
programming bugs in the toolset itself get silently returned as JSON to the LLM
rather than surfacing for debugging. `SQLToolset._query` does not do this.
Catch only expected exceptions (`SQLSafetyError`, query execution errors) or
at minimum log the exception before returning it.
---
### [MEDIUM] Inconsistent error handling across the three tools
`_query` catches exceptions → returns error JSON. `_list_tables` and
`_get_schema` let exceptions propagate uncaught. Within the same toolset, the
LLM gets different error behavior depending on which tool fails. Pick one
approach and apply it consistently.
---
### [MEDIUM] `allow_writes` should be keyword-only
```python
def __init__(
self,
datasource_configs: list[DataSourceConfig],
allow_writes: bool = False, # positional
*,
max_rows: int = 50,
) -> None:
```
`SQLToolset` makes `allow_writes` keyword-only (after `*`). For a
security-sensitive parameter that enables data-modifying SQL, positional usage
is error-prone — `DataFusionToolset([cfg], True)` reads poorly. Move it after
`*` to match `SQLToolset`.
---
### [LOW] Test name `test_id_includes_sorted_table_names` is misleading
The test asserts `ts.id == "sql_datafusion_beta_alpha"` — that's insertion
order, not sorted order. Either sort the table names in the `id` property, or
rename the test to `test_id_includes_table_names_in_order`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]