Re: [PR] feat: add review-datafusion-pr Claude Code skill [datafusion-comet]

via GitHub Mon, 20 Apr 2026 09:15:00 -0700


comphead commented on code in PR #3974:
URL: https://github.com/apache/datafusion-comet/pull/3974#discussion_r3112078179



##########
.claude/skills/review-datafusion-pr/SKILL.md:
##########
@@ -0,0 +1,295 @@
+---
+name: review-datafusion-pr
+description: Review an Apache DataFusion pull request for correctness, Spark 
compatibility of `datafusion-spark` expressions, and for breaking API changes 
that may affect DataFusion Comet. Provides guidance to a reviewer rather than 
posting comments directly.
+argument-hint: <pr-number>
+---
+
+Review DataFusion PR #$ARGUMENTS
+
+## Before You Start
+
+### Gather PR Metadata
+
+Fetch the PR details to understand the scope:
+
+```bash
+gh pr view $ARGUMENTS --repo apache/datafusion --json 
title,body,author,isDraft,state,files
+```
+
+### Review Existing Comments First
+
+Before forming your review:
+
+1. **Read all existing review comments** on the PR
+2. **Check the conversation tab** for any discussion
+3. **Avoid duplicating feedback** that others have already provided
+4. **Build on existing discussions** rather than starting new threads on the 
same topic
+5. **If you have no additional concerns beyond what's already discussed, say 
so**
+6. **Ignore Copilot reviews** - do not reference or build upon comments from 
GitHub Copilot
+
+```bash
+gh pr view $ARGUMENTS --repo apache/datafusion --comments
+```
+
+### Classify the PR
+
+Look at the changed files and decide which review tracks apply. More than one 
may apply.
+
+- **`datafusion/spark/` expression change** — adds or modifies a 
Spark-compatible function. Apply the Spark expression track.
+- **`datafusion/sqllogictest/test_files/spark/` test change** — modifies 
`.slt` tests for Spark functions. Apply the Spark expression track.
+- **Core DataFusion change** — modifies `datafusion/core`, 
`datafusion/physical-expr`, `datafusion/physical-plan`, `datafusion/expr`, 
`datafusion/common`, `datafusion/datasource`, or 
`datafusion/physical-expr-adapter`. Apply the Comet API impact track. These are 
the crates Comet depends on directly.
+- **Other** — protocol-level changes, release infrastructure, docs. Comet 
impact is usually low, but still skim for breaking changes.
+
+The Comet native workspace declares its DataFusion dependencies in 
`native/Cargo.toml`. The crates Comet currently pulls in are `datafusion`, 
`datafusion-datasource`, `datafusion-physical-expr-adapter`, and 
`datafusion-spark`. Changes to public items in those crates are the ones most 
likely to affect Comet.
+
+---
+
+## Review Workflow
+
+### 1. Gather Context
+
+```bash
+# View the full diff
+gh pr diff $ARGUMENTS --repo apache/datafusion
+```
+
+For expression PRs, check how similar expressions are implemented in 
`datafusion/spark/src/function/`. The subdirectories mirror expression 
categories (`math/`, `string/`, `datetime/`, `array/`, `hash/`, etc.). Each 
function typically has a Rust module and a matching `.slt` file under 
`datafusion/sqllogictest/test_files/spark/<category>/<name>.slt`.
+
+### 2. Spark Expression Track
+
+For PRs that add or modify a `datafusion-spark` function, correctness means 
**matching Spark behavior**, not DataFusion's own semantics. SLT files use 
DataFusion syntax and run against DataFusion only, so they cannot verify Spark 
equivalence on their own. The reviewer must cross-check behavior against the 
real Spark implementation.
+
+#### 2.1 Read the Spark source code
+
+1. **Clone or update the Spark repo:**
+
+   ```bash
+   if [ ! -d /tmp/spark ]; then
+     git clone --depth 1 https://github.com/apache/spark.git /tmp/spark
+   fi
+   ```
+
+2. **Find the expression class:**
+
+   ```bash
+   find /tmp/spark/sql/catalyst/src/main/scala -name "*.scala" \
+     | xargs grep -l "case class <ExpressionName>\b"
+   ```
+
+3. **Read the Spark implementation carefully.** Pay attention to:
+   - `eval`, `nullSafeEval`, and `doGenCode` (these define the canonical 
behavior)
+   - `inputTypes` and `dataType` (accepted input types and return type)
+   - Null handling (`nullable`, `nullSafeEval`)
+   - ANSI mode branches (look for `SQLConf.get.ansiEnabled` or `failOnError`)
+   - Special cases, guards, `require` assertions, runtime exceptions
+
+4. **Read the Spark tests for the expression:**
+
+   ```bash
+   find /tmp/spark/sql -name "*.scala" -path "*/test/*" \
+     | xargs grep -l "<ExpressionName>"
+   ```
+
+5. **Compare the Spark behavior against the DataFusion implementation in the 
PR.** Identify:
+   - Edge cases tested in Spark but not covered in the `.slt` file
+   - Data types supported in Spark but not handled in the PR
+   - Behavioral differences that the PR does not document
+
+#### 2.2 Review the Rust implementation
+
+Location: `datafusion/spark/src/function/<category>/<name>.rs`
+
+Check:
+
+- Null handling propagates correctly
+- Overflow and underflow return `Err` when Spark would throw under ANSI mode, 
return the Spark-defined value otherwise
+- Type dispatch covers all types that Spark supports for this function
+- ANSI vs non-ANSI branches match Spark behavior. The SLT framework exposes 
`datafusion.execution.enable_ansi_mode` for this.
+- Uses Arrow compute kernels where available. Row-by-row loops are a red flag.
+- No panics. Return `DataFusionError` variants instead.
+- UDF registration is wired up in the function module's `mod.rs` and in the 
category's `mod.rs`.
+
+#### 2.3 Review the `.slt` test file
+
+Location: `datafusion/sqllogictest/test_files/spark/<category>/<name>.slt`
+
+**The `.slt` test format, in brief:**
+
+```sql
+# Scalar input
+query I
+SELECT abs(-1::INT);
+----
+1
+
+# Array input
+query I
+SELECT abs(a) FROM (VALUES (-1::INT), (NULL)) AS t(a);
+----
+1
+NULL
+
+# ANSI mode error
+statement ok
+set datafusion.execution.enable_ansi_mode = true;
+
+query error DataFusion error: Arrow error: Compute error: Int32 overflow on 
abs\(\-2147483648\)
+select abs((-2147483648)::INT);
+
+statement ok
+set datafusion.execution.enable_ansi_mode = false;
+```
+
+Verify the test file against the `datafusion-spark` testing guide in 
`datafusion/sqllogictest/test_files/spark/README.md`:
+
+- [ ] Both scalar and array inputs are exercised (the README requires this)
+- [ ] All accepted Spark input types are tested with explicit casts (`0::INT`, 
`0::BIGINT`, etc.) — DataFusion and Spark do not infer types the same way
+- [ ] Null input is tested
+- [ ] Edge cases: empty string, boundary values (e.g., `INT_MIN`), `NaN`, 
`Infinity`, `-0.0`, negative values for numeric functions
+- [ ] ANSI mode behavior is wrapped in `set 
datafusion.execution.enable_ansi_mode = true/false` pairs where Spark differs 
between modes
+- [ ] Test only contains `SELECT` statements for the function under test, with 
no unrelated setup
+- [ ] Header comments cite the upstream source if ported (the existing files 
show the pattern)

Review Comment:
   for nested types it is needed to test empty values if applicable, like empty 
array, map and mix of empty and non empty entries



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: add review-datafusion-pr Claude Code skill [datafusion-comet]

Reply via email to