n0r0shi opened a new pull request, #20487:
URL: https://github.com/apache/datafusion/pull/20487
## Which issue does this PR close?
Closes #<TBD>.
## Rationale for this change
Apache Spark's `levenshtein(str1, str2, threshold)` supports an optional
third argument that caps the computation: if the edit distance exceeds the
threshold, it returns -1 with early termination. DataFusion currently only
supports the 2-argument form. This addition enables the downstream [Apache
DataFusion Comet](https://github.com/apache/datafusion-comet) project to
natively support Spark's 3-argument `levenshtein` instead of falling back to
Spark.
## What changes are included in this PR?
- **`datafusion-common`**: Added `levenshtein_with_threshold` and
`levenshtein_with_threshold_and_buffer` to the `datafusion_strsim` module. The
algorithm uses banded dynamic programming (only computing cells within
`threshold` distance of the diagonal) with early termination, ported from
Apache Spark's `UTF8String.levenshteinDistance` / Apache Commons Text
`LevenshteinDistance.limitedCompare`.
- **`datafusion-functions`**: Extended `LevenshteinFunc` to accept an
optional third `Int32` argument (with implicit coercion from `Int64` so bare
integer literals like `2` work without explicit `::int` cast). Null threshold
produces null output. The existing 2-argument behavior is unchanged.
- **sqllogictests**: Uncommented the spark levenshtein tests and added 13
new test cases covering various thresholds, empty strings, and null handling.
### Design note: separate implementations
The threshold algorithm is kept separate from the existing
`levenshtein_with_buffer`. The existing implementation returns `usize`
(non-negative distance) using a generic iterator-based approach, while the
threshold version returns `i32` (can be -1) and requires index-based access for
the banded DP computation. We kept them separate to avoid changing the existing
function's return type and interface, but would welcome community input on
whether unification is preferred.
## Are these changes tested?
- 4 unit tests (basic, threshold, null handling, edge cases)
- 3 doc-tests for the new functions
- 2 spark sqllogictest queries (2-arg and 3-arg)
- 13 new sqllogictest queries in `string_literal.slt`
## Are there any user-facing changes?
Yes. `levenshtein` now accepts an optional third argument:
```sql
-- Existing (unchanged)
SELECT levenshtein('kitten', 'sitting'); -- 3
-- New: with threshold
SELECT levenshtein('kitten', 'sitting', 2); -- -1 (distance 3 >
threshold 2)
SELECT levenshtein('kitten', 'sitting', 5); -- 3 (distance 3 <=
threshold 5)
SELECT levenshtein('kitten', 'sitting', NULL); -- NULL
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]