kaxil opened a new pull request, #62785:
URL: https://github.com/apache/airflow/pull/62785

   Part of AIP-99: https://github.com/orgs/apache/projects/586 : Toolsets that 
expose Airflow hooks as pydantic-ai agent tools.
   
   - **HookToolset** — generic adapter that exposes any Airflow Hook's methods 
as pydantic-ai tools via signature introspection. Requires explicit 
`allowed_methods` list (no auto-discovery). Builds JSON Schema
   from method signatures, enriches tool descriptions from docstrings (Sphinx 
and Google style).
   - **SQLToolset** — curated 4-tool database toolset (`list_tables`, 
`get_schema`, `query`, `check_query`) wrapping `DbApiHook`. Read-only by 
default with SQL validation, `allowed_tables` metadata filtering, and
    `max_rows` truncation.
   
   Both implement pydantic-ai's `AbstractToolset` interface.
   
   ## Design rationale
   
   **Why custom introspection instead of pydantic-ai's `_function_schema`?** 
Hook methods are bound methods with `self`, decorators like 
`@provide_bucket_name`, and complex signatures. Our lightweight approach
   (`inspect.signature` + `get_type_hints`) avoids coupling to pydantic-ai 
internals.
   
   **Why `sequential=True` on all tool definitions?** Hook methods perform 
synchronous I/O and share connection state. Concurrent execution would be 
unsafe.
   
   **Why `allowed_tables` is metadata-only, not query-level validation?** 
Parsing SQL for table references (CTEs, subqueries, aliases, vendor-specific 
syntax) is complex and error-prone. We chose not to provide a
    false sense of security. Real access control belongs at the DB permission 
level.
   
   **Why HookToolset requires explicit `allowed_methods`?** Auto-discovery 
would expose every public method on a hook (including `run()`, 
`get_connection()`, etc.), giving an LLM broad unintended access. Explicit
    listing forces DAG authors to think about the blast radius.
   
   ## Usage
   
   ```python
   from airflow.providers.common.ai.toolsets.hook import HookToolset
   from airflow.providers.common.ai.toolsets.sql import SQLToolset
   
   # SQL toolset — 4 curated tools for database access
   sql_tools = SQLToolset(
       db_conn_id="postgres_default",
       allowed_tables=["customers", "orders"],
       max_rows=20,
   )
   
   # Hook toolset — wrap any hook's methods as tools
   from airflow.providers.http.hooks.http import HttpHook
   http_tools = HookToolset(
       HttpHook(http_conn_id="my_api"),
       allowed_methods=["run"],
       tool_name_prefix="http_",
   )
   ```
   
   ## Gotchas / Tradeoffs
   
   - `allowed_tables` hides tables from `list_tables`/`get_schema` but does NOT 
parse SQL queries. An LLM can `SELECT * FROM secrets` if it guesses the name. 
Use DB permissions for real access control.
   - `HookToolset` exposes whatever methods you list — the agent controls the 
arguments. Don't expose `run()` or `get_connection()`.
   - `allow_writes=False` (default) validates SQL through `validate_sql()` and 
rejects INSERT/UPDATE/DELETE/DROP.
   - SQLToolset lazy-resolves the `DbApiHook` on first use via 
`BaseHook.get_connection(conn_id).get_hook()`. Non-DbApiHook connections raise 
`ValueError`.
   
    <!-- SPDX-License-Identifier: Apache-2.0
         https://www.apache.org/licenses/LICENSE-2.0 -->
   
   <!--
   Thank you for contributing!
   
   Please provide above a brief description of the changes made in this pull 
request.
   Write a good git commit message following this guide: 
http://chris.beams.io/posts/git-commit/
   
   Please make sure that your code changes are covered with tests.
   And in case of new features or big changes remember to adjust the 
documentation.
   
   Feel free to ping (in general) for the review if you do not see reaction for 
a few days
   (72 Hours is the minimum reaction time you can expect from volunteers) - we 
sometimes miss notifications.
   
   In case of an existing issue, reference it using one of the following:
   
   * closes: #ISSUE
   * related: #ISSUE
   -->
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   <!--
   If generative AI tooling has been used in the process of authoring this PR, 
please
   change below checkbox to `[X]` followed by the name of the tool, uncomment 
the "Generated-by".
   -->
   
   - [ ] Yes (please specify the tool below)
   
   <!--
   Generated-by: [Tool Name] following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
   -->
   
   ---
   
   * Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information. Note: commit author/co-author name and email in commits 
become permanently public when merged.
   * For fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   * When adding dependency, check compliance with the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   * For significant user-facing changes create newsfragment: 
`{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in 
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to