yibo wen created CALCITE-7612:
---------------------------------
Summary: Track whether a column origin is derived from an aggregate
Key: CALCITE-7612
URL: https://issues.apache.org/jira/browse/CALCITE-7612
Project: Calcite
Issue Type: Improvement
Components: core
Affects Versions: 1.38.0
Reporter: yibo wen
Description:
RelColumnOrigin currently exposes whether an output column is derived from an
origin column via isDerived(), but it does not distinguish ordinary expression
derivation from
aggregate derivation.
For example:
SELECT a + b AS c FROM t
and
SELECT SUM(a) AS s FROM t
both produce derived column origins, but downstream lineage or
impact-analysis tools may need to distinguish whether the output column was
derived by an aggregate call.
Expected behavior:
Column-origin metadata should be able to tell whether an origin is derived
from an aggregate expression.
Motivation:
For column lineage, data governance, and impact analysis, aggregate-derived
columns often need to be handled differently from ordinary expression-derived
columns. For example,
SUM(a), COUNT(a), AVG(a) and a + b all depend on source columns, but their
semantic lineage is different.
Possible design direction:
Extend RelColumnOrigin or related metadata to expose aggregate-derived
information. This may require discussion because RelColumnOrigin is part of
Calcite's public metadata API.
Open questions:
Should aggregate derivation be represented as a new boolean flag, a
derivation kind enum, or a separate metadata API?
Should this information affect equals/hashCode semantics of RelColumnOrigin?
How should aggregate calls with zero arguments, such as COUNT(*), be
represented?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)