comphead commented on code in PR #3974: URL: https://github.com/apache/datafusion-comet/pull/3974#discussion_r3112078179
########## .claude/skills/review-datafusion-pr/SKILL.md: ########## @@ -0,0 +1,295 @@ +--- +name: review-datafusion-pr +description: Review an Apache DataFusion pull request for correctness, Spark compatibility of `datafusion-spark` expressions, and for breaking API changes that may affect DataFusion Comet. Provides guidance to a reviewer rather than posting comments directly. +argument-hint: <pr-number> +--- + +Review DataFusion PR #$ARGUMENTS + +## Before You Start + +### Gather PR Metadata + +Fetch the PR details to understand the scope: + +```bash +gh pr view $ARGUMENTS --repo apache/datafusion --json title,body,author,isDraft,state,files +``` + +### Review Existing Comments First + +Before forming your review: + +1. **Read all existing review comments** on the PR +2. **Check the conversation tab** for any discussion +3. **Avoid duplicating feedback** that others have already provided +4. **Build on existing discussions** rather than starting new threads on the same topic +5. **If you have no additional concerns beyond what's already discussed, say so** +6. **Ignore Copilot reviews** - do not reference or build upon comments from GitHub Copilot + +```bash +gh pr view $ARGUMENTS --repo apache/datafusion --comments +``` + +### Classify the PR + +Look at the changed files and decide which review tracks apply. More than one may apply. + +- **`datafusion/spark/` expression change** — adds or modifies a Spark-compatible function. Apply the Spark expression track. +- **`datafusion/sqllogictest/test_files/spark/` test change** — modifies `.slt` tests for Spark functions. Apply the Spark expression track. +- **Core DataFusion change** — modifies `datafusion/core`, `datafusion/physical-expr`, `datafusion/physical-plan`, `datafusion/expr`, `datafusion/common`, `datafusion/datasource`, or `datafusion/physical-expr-adapter`. Apply the Comet API impact track. These are the crates Comet depends on directly. +- **Other** — protocol-level changes, release infrastructure, docs. Comet impact is usually low, but still skim for breaking changes. + +The Comet native workspace declares its DataFusion dependencies in `native/Cargo.toml`. The crates Comet currently pulls in are `datafusion`, `datafusion-datasource`, `datafusion-physical-expr-adapter`, and `datafusion-spark`. Changes to public items in those crates are the ones most likely to affect Comet. + +--- + +## Review Workflow + +### 1. Gather Context + +```bash +# View the full diff +gh pr diff $ARGUMENTS --repo apache/datafusion +``` + +For expression PRs, check how similar expressions are implemented in `datafusion/spark/src/function/`. The subdirectories mirror expression categories (`math/`, `string/`, `datetime/`, `array/`, `hash/`, etc.). Each function typically has a Rust module and a matching `.slt` file under `datafusion/sqllogictest/test_files/spark/<category>/<name>.slt`. + +### 2. Spark Expression Track + +For PRs that add or modify a `datafusion-spark` function, correctness means **matching Spark behavior**, not DataFusion's own semantics. SLT files use DataFusion syntax and run against DataFusion only, so they cannot verify Spark equivalence on their own. The reviewer must cross-check behavior against the real Spark implementation. + +#### 2.1 Read the Spark source code + +1. **Clone or update the Spark repo:** + + ```bash + if [ ! -d /tmp/spark ]; then + git clone --depth 1 https://github.com/apache/spark.git /tmp/spark + fi + ``` + +2. **Find the expression class:** + + ```bash + find /tmp/spark/sql/catalyst/src/main/scala -name "*.scala" \ + | xargs grep -l "case class <ExpressionName>\b" + ``` + +3. **Read the Spark implementation carefully.** Pay attention to: + - `eval`, `nullSafeEval`, and `doGenCode` (these define the canonical behavior) + - `inputTypes` and `dataType` (accepted input types and return type) + - Null handling (`nullable`, `nullSafeEval`) + - ANSI mode branches (look for `SQLConf.get.ansiEnabled` or `failOnError`) + - Special cases, guards, `require` assertions, runtime exceptions + +4. **Read the Spark tests for the expression:** + + ```bash + find /tmp/spark/sql -name "*.scala" -path "*/test/*" \ + | xargs grep -l "<ExpressionName>" + ``` + +5. **Compare the Spark behavior against the DataFusion implementation in the PR.** Identify: + - Edge cases tested in Spark but not covered in the `.slt` file + - Data types supported in Spark but not handled in the PR + - Behavioral differences that the PR does not document + +#### 2.2 Review the Rust implementation + +Location: `datafusion/spark/src/function/<category>/<name>.rs` + +Check: + +- Null handling propagates correctly +- Overflow and underflow return `Err` when Spark would throw under ANSI mode, return the Spark-defined value otherwise +- Type dispatch covers all types that Spark supports for this function +- ANSI vs non-ANSI branches match Spark behavior. The SLT framework exposes `datafusion.execution.enable_ansi_mode` for this. +- Uses Arrow compute kernels where available. Row-by-row loops are a red flag. +- No panics. Return `DataFusionError` variants instead. +- UDF registration is wired up in the function module's `mod.rs` and in the category's `mod.rs`. + +#### 2.3 Review the `.slt` test file + +Location: `datafusion/sqllogictest/test_files/spark/<category>/<name>.slt` + +**The `.slt` test format, in brief:** + +```sql +# Scalar input +query I +SELECT abs(-1::INT); +---- +1 + +# Array input +query I +SELECT abs(a) FROM (VALUES (-1::INT), (NULL)) AS t(a); +---- +1 +NULL + +# ANSI mode error +statement ok +set datafusion.execution.enable_ansi_mode = true; + +query error DataFusion error: Arrow error: Compute error: Int32 overflow on abs\(\-2147483648\) +select abs((-2147483648)::INT); + +statement ok +set datafusion.execution.enable_ansi_mode = false; +``` + +Verify the test file against the `datafusion-spark` testing guide in `datafusion/sqllogictest/test_files/spark/README.md`: + +- [ ] Both scalar and array inputs are exercised (the README requires this) +- [ ] All accepted Spark input types are tested with explicit casts (`0::INT`, `0::BIGINT`, etc.) — DataFusion and Spark do not infer types the same way +- [ ] Null input is tested +- [ ] Edge cases: empty string, boundary values (e.g., `INT_MIN`), `NaN`, `Infinity`, `-0.0`, negative values for numeric functions +- [ ] ANSI mode behavior is wrapped in `set datafusion.execution.enable_ansi_mode = true/false` pairs where Spark differs between modes +- [ ] Test only contains `SELECT` statements for the function under test, with no unrelated setup +- [ ] Header comments cite the upstream source if ported (the existing files show the pattern) Review Comment: for nested types it is needed to test empty values if applicable, like empty array, map and mix of empty and non empty entries -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
