kaori-seasons commented on issue #60951:
URL: https://github.com/apache/doris/issues/60951#issuecomment-4002877928

   > Thank you for such a comprehensive and well-structured proposal! This is a 
high-quality design document that clearly outlines the technical path for 
Snowflake dialect compatibility. We really appreciate the detailed analysis of 
the function signatures and the "decomposable" implementation strategy.
   > 
   > We agree with your design principles, especially **G-5 (not altering 
existing function behavior)** and the prioritization of FE-layer solutions. The 
tiered classification (Tier 0-5) is very practical.
   > 
   > Before we proceed, we have a few questions regarding the overall strategy 
and future maintainability:
   > 
   > 1. **Compatibility Coverage Measurement**: How do you plan to measure the 
final compatibility coverage? For the goal of ">90% high-frequency function 
compatibility," will you be using a curated list of Snowflake functions to 
validate this metric?
   > 2. **Handling Hard-to-Compat Functions**: For functions that cannot be 
easily handled by FE aliases or rewriting (e.g., complex `LATERAL FLATTEN` 
usage or `OBJECT_CONSTRUCT` NULL-skipping), do you envision these being handled 
primarily by the Dialect Converter Plugin? If so, should we consider 
consolidating more logic into the SQL Converter to keep the core engine clean, 
or do you see value in implementing specific BE functions for them?
   > 3. **Dialect Conflict Management**: Regarding potential conflicts, if we 
introduce support for other dialects in the future (e.g., Databricks SQL), how 
do we plan to handle function name collisions or semantic differences (like the 
`SPLIT` parameter order you mentioned)? Should the alias registration mechanism 
be made dialect-aware to ensure isolation?
   > 
   > We look forward to your thoughts on the questions above!
   
   
   @morningman 
   Your insights are very insightful and comprehensive. Please allow me to 
address each of the following questions:
   
   ###  Question 1: How to Measure Compatibility Coverage?
   
   **Question**:
   > How do you plan to measure the final compatibility coverage? For the goal 
of ">90% high-frequency function compatibility," will you be using a curated 
list of Snowflake functions to validate this metric?
   
   **Thoughts**:
   
   A single percentage metric can be misleading—for example, "supporting 80% of 
functions" might include many low-frequency niche functions. We recommend 
adopting a **multi-dimensional measurement framework**:
   
   | Measurement Dimension | Description | Data Source |
   |----------------------|-------------|-------------|
   | **Function-level Coverage** | Compatible functions / Total functions | 
Snowflake official documentation |
   | **High-frequency Weighted Coverage** | Weighted calculation by usage 
frequency | Real customer query logs (anonymized) |
   | **SQL Direct Execution Rate** | Proportion of SQL that executes without 
modification | 10,000 migration SQL corpus |
   
   **Phased Goals**:
   
   ```
   After Phase 1 → High-frequency Top 50 coverage ≥ 60%
   After Phase 2 → High-frequency Top 50 coverage ≥ 85%
   After Phase 3 → SQL direct execution rate ≥ 90%
   ```
   
   **Follow-up Actions**: We are willing to maintain a compatibility matrix 
document that clearly marks each Snowflake function's compatibility status ( 
Fully Compatible /  Approximately Compatible /  Not Compatible).
   
   ---
   
   ###  Question 2: How to Handle Hard-to-Compat Functions?
   
   **Question**:
   > For functions that cannot be easily handled by FE aliases or rewriting 
(e.g., complex LATERAL FLATTEN usage or OBJECT_CONSTRUCT NULL-skipping), do you 
envision these being handled primarily by the Dialect Converter Plugin? If so, 
should we consider consolidating more logic into the SQL Converter to keep the 
core engine clean, or do you see value in implementing specific BE functions 
for them?
   
   **Thoughts**:
   
   This is an excellent architectural trade-off question. Our core position is: 
**Converter handles syntax-level differences, FE/BE handles runtime semantic 
differences**.
   
   **Why can't OBJECT_CONSTRUCT rely on Converter?**
   
   Snowflake's `OBJECT_CONSTRUCT` has a special semantic: when value is NULL, 
skip that key. This requires **runtime judgment**—when Converter works at the 
SQL text level, it cannot know whether an expression (possibly a column 
reference, subquery result, etc.) is NULL.
   
   ```
   -- Converter can only see this text, cannot know if column_value is NULL at 
runtime
   OBJECT_CONSTRUCT('key1', column_value, 'key2', 'static_value')
   ```
   
   **Decision Matrix**:
   
   | Function/Syntax |归属 | Reason |
   |----------------|------|--------|
   | `OBJECT_CONSTRUCT` NULL-skipping | New BE function | Requires runtime 
judgment |
   | `FLATTEN` TVF basic version | New BE Table Function | Outputs multiple 
rows and columns, Converter cannot simulate |
   | `LATERAL FLATTEN` syntax | Dialect Converter | Pure syntax-level 
transformation |
   | `QUALIFY` clause | Dialect Converter | Rewrite to subquery |
   | `SPLIT` parameter order conflict | Dialect Converter | Don't change 
existing behavior, swap under dialect mode |
   
   **View on "Keeping the Core Engine Clean"**:
   
   We understand the community's concerns, but **semantic correctness must come 
first**. If Converter cannot guarantee correctness, new logic should be 
implemented in FE/BE. This also aligns with the G-2 principle in our proposal.
   
   ---
   
   ###  Question 3: How to Manage Multi-Dialect Conflicts?
   
   **Question**:
   > Regarding potential conflicts, if we introduce support for other dialects 
in the future (e.g., Databricks SQL), how do we plan to handle function name 
collisions or semantic differences (like the SPLIT parameter order you 
mentioned)? Should the alias registration mechanism be made dialect-aware to 
ensure isolation?
   
   **Thoughts**:
   
   This is a key question for architectural foresight. We recommend adopting a 
**progressive dialect isolation strategy**:
   
   #### Short-term Solution (Phase 1-2)
   
   **No dialect awareness needed**. The 16 pure aliases + 7 expression rewrite 
functions at this stage **do not involve conflicts**—`IFF`, `LEN`, `CHARINDEX`, 
etc., do not exist in current dialects.
   
   For the `SPLIT` parameter order conflict (Snowflake: `SPLIT(str, delim)` vs 
Doris: `SPLIT(delim, str)`), we document the difference and suggest Snowflake 
users explicitly use `split_by_string(delim, str)`.
   
   #### Mid-term Solution (Phase 5)
   
   Introduce **Session-level dialect mode**:
   
   ```sql
   -- Set dialect mode
   SET SESSION sql_dialect = 'snowflake';
   
   -- Under this session, SPLIT automatically swaps parameters
   SELECT SPLIT('hello,world', ',');  -- Internally converts to 
split_by_string(',', 'hello,world')
   ```
   
   #### Long-term Architecture
   
   Function lookup supports **(function_name, dialect, parameter_types)** 
triple-key indexing:
   
   ```
   ("split", DORIS,      (VARCHAR, VARCHAR)) → SplitByString     // delim first
   ("split", SNOWFLAKE,  (VARCHAR, VARCHAR)) → SplitSnowflake    // string first
   ("split", DATABRICKS, (VARCHAR, VARCHAR)) → SplitByString     // delim first
   ```
   
   **Outlook for Databricks SQL**:
   
   If Databricks is supported in the future, we expect `SPLIT` to also have 
conflicts (Databricks and Snowflake have consistent parameter order, both 
string first). This means we need a more general dialect isolation mechanism.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to