cetingokhan commented on PR #62963:
URL: https://github.com/apache/airflow/pull/62963#issuecomment-4007934653
Sample Task Log;
.......
[2026-03-05T17:41:42.539957Z] INFO - Registered object store for schema:
s3://
[2026-03-05T17:41:42.918575Z] INFO - Registered data source format parquet
for table: sales_data
[2026-03-05T17:41:42.920314Z] INFO - Using schema context:
Table: sales_data
Columns: price: int64
area: int64
bedrooms: int64
bathrooms: int64
stories: int64
mainroad: string_view
guestroom: string_view
basement: string_view
hotwaterheating: string_view
airconditioning: string_view
parking: int64
prefarea: string_view
furnishingstatus: string_view
[2026-03-05T17:41:42.947772Z] INFO - DQ plan cache miss — generating via LLM
(key: 'dq_plan_v8_535ba38ccff9be9a').
[2026-03-05T17:41:42.948151Z] INFO - Using system prompt:
You are a data-quality SQL expert.
Given a set of named data-quality checks and a database schema, produce a
DQPlan that minimises the number of SQL queries executed.
PRIMARY RULE — combine everything on the same table into ONE SELECT:
All checks that query the same table MUST be merged into a single SELECT
statement. Each check becomes one output column in that statement.
CORRECT (one query for two checks on the same table):
SELECT
(COUNT(*) FILTER (WHERE email IS NULL) * 100.0 / COUNT(*)) AS
null_email_pct,
COUNT(*) FILTER (WHERE bathrooms < 0) AS
invalid_bathrooms
FROM customers
WRONG (two separate queries for the same table):
SELECT COUNT(*) FILTER (WHERE email IS NULL) AS null_email_count FROM
customers
SELECT COUNT(*) FILTER (WHERE bathrooms < 0) AS invalid_bathrooms FROM
customers
GROUPING STRATEGY:
Assign a group_id that describes the table being queried (e.g.
"sales_data_checks").
Only split into multiple groups when the checks genuinely require
different tables
or subqueries that cannot be expressed as columns of a single SELECT.
OUTPUT RULES:
1. Each output column must be aliased to exactly the metric_key of its
check.
Example: ... AS null_email_pct
2. Each check_name must exactly match the key in the prompts dict.
3. metric_key values must be valid SQL column aliases (snake_case, no
spaces).
4. Generates only SELECT queries — no INSERT, UPDATE, DELETE, DROP, or DDL.
5. Use SQL syntax.
6. Each check must appear in exactly ONE group.
7. Return a valid DQPlan object. No extra commentary.
Available schema:
Table: sales_data
Columns: price: int64
area: int64
bedrooms: int64
bathrooms: int64
stories: int64
mainroad: string_view
guestroom: string_view
basement: string_view
hotwaterheating: string_view
airconditioning: string_view
parking: int64
prefarea: string_view
furnishingstatus: string_view
[2026-03-05T17:41:42.948518Z] INFO - Using user message:
Generate a DQPlan for the following data-quality checks.
IMPORTANT: All checks that query the same table MUST be combined into a
single SELECT with one output column per check.
Checks:
- check_name="null_mainroad": Check the percentage of rows where mainroad
is NULL
- check_name="invalid_bathrooms": Count rows where bathrooms is negative
or NULL
- check_name="bathroom_data": Count rows where bathrooms is greater than 2
- check_name="furnishingstatus_values": furnishingstatus column contains
only 'furnished', 'semi-furnished', or 'unfurnished'
......
[2026-03-05T17:41:49.436353Z] INFO - Registered data source format parquet
for table: sales_data
[2026-03-05T17:41:49.438619Z] INFO - Executing query: SELECT
(COUNT(*) FILTER (WHERE mainroad IS NULL) * 100.0 / COUNT(*)) AS
null_mainroad,
COUNT(*) FILTER (WHERE bathrooms IS NULL OR bathrooms < 0) AS
invalid_bathrooms,
COUNT(*) FILTER (WHERE bathrooms > 2) AS bathroom_data,
COUNT(*) FILTER (WHERE furnishingstatus NOT IN ('furnished',
'semi-furnished', 'unfurnished')) AS furnishingstatus_values
FROM sales_data
[2026-03-05T17:41:49.617786Z] INFO - All 4 data-quality check(s) passed.
[2026-03-05T17:41:49.619021Z] INFO - Pushing xcom
ti=RuntimeTaskInstance(id=UUID('019cbf17-4648-7359-ab77-9e24023d0c69'),
task_id='validate_sales_events', dag_id='example_llm_dq_s3_parquet',
run_id='manual__2026-03-05T17:41:39.517609+00:00', try_number=1,
dag_version_id=UUID('019cbf09-ae3e-7f97-afb3-cc5ec0a4160f'), map_index=-1,
hostname='b483f76e2940', context_carrier={},
task=<Task(LLMDataQualityOperator): validate_sales_events>,
bundle_instance=LocalDagBundle(name=dags-folder), max_tries=0,
start_date=datetime.datetime(2026, 3, 5, 17, 41, 40, 355724,
tzinfo=datetime.timezone.utc), end_date=None, state=<TaskInstanceState.RUNNING:
'running'>, is_mapped=False, rendered_map_index=None, sentry_integration='')
[2026-03-05T17:41:49.669975Z] INFO - Task instance in success state
[2026-03-05T17:41:49.670192Z] INFO - Previous state of the Task instance:
TaskInstanceState.RUNNING
[2026-03-05T17:41:49.670349Z] INFO - Task
operator:<Task(LLMDataQualityOperator): validate_sales_events>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]