cetingokhan opened a new pull request, #62963:
URL: https://github.com/apache/airflow/pull/62963

   ## AIP - 99 LLMDataQualityOperator
   
   This pull request introduces a new **LLMDataQualityOperator** for generating 
and executing data-quality checks using natural language prompts and LLMs, 
along with supporting utilities for database/schema introspection and example 
usage. The changes add a robust operator for data-quality validation, enable 
schema context resolution for both relational and object-storage sources.
   
   
   ### How It Works
   **Plan Generation (LLM-backed)**: The operator accepts a prompts dict 
mapping check names to natural-language expectations (e.g. "email_nulls": "Less 
than 5% of emails should be null"). It introspects the target database schema 
and sends prompts + schema context to the configured LLM, which produces a 
DQPlan — a set of optimised SQL query groups.
   
   **Plan Caching**: Generated plans are serialised and stored in Airflow 
Variable (key: dq_plan_<version>_<sha256[:16]>). Cache key is computed from a 
sorted serialisation of prompts + prompt_version, making it order-independent 
and version-bumped when prompts change semantically. This avoids redundant LLM 
calls on rerun.
   
   **Execution**: Each SQL group is executed against the target DB via a 
DbApiHook. Results are collected per check name into a results_map.
   
   **Validation**: Each metric value is passed to the corresponding callable in 
validators. A check passes if no validator is provided (metrics are collected 
but not gated) or if the validator returns True. Failures record the reason.
   
   **Dry Run Mode**: When dry_run=True, the plan is generated/cached but not 
executed. 
   
   <!-- SPDX-License-Identifier: Apache-2.0
         https://www.apache.org/licenses/LICENSE-2.0 -->
   
   <!--
   Thank you for contributing!
   
   Please provide above a brief description of the changes made in this pull 
request.
   Write a good git commit message following this guide: 
http://chris.beams.io/posts/git-commit/
   
   Please make sure that your code changes are covered with tests.
   And in case of new features or big changes remember to adjust the 
documentation.
   
   Feel free to ping (in general) for the review if you do not see reaction for 
a few days
   (72 Hours is the minimum reaction time you can expect from volunteers) - we 
sometimes miss notifications.
   
   In case of an existing issue, reference it using one of the following:
   
   * closes: #ISSUE
   * related: #ISSUE
   -->
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   <!--
   If generative AI tooling has been used in the process of authoring this PR, 
please
   change below checkbox to `[X]` followed by the name of the tool, uncomment 
the "Generated-by".
   -->
   
   - [X] Yes 
   Cloude Sonnet 4.6 & Gemini 3.1 Pro
   Filled some of methods scope and tests created via copilot
   
   <!--
   Generated-by: [Tool Name] following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
   -->
   
   ---
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to