[PR] feat: support optional threshold parameter for levenshtein function [datafusion]

via GitHub Mon, 23 Feb 2026 00:19:33 -0800


n0r0shi opened a new pull request, #20487:
URL: https://github.com/apache/datafusion/pull/20487


   
   ## Which issue does this PR close?
   
     Closes #<TBD>.
   
     ## Rationale for this change
   
     Apache Spark's `levenshtein(str1, str2, threshold)` supports an optional 
third argument that caps the computation: if the edit distance exceeds the 
threshold, it returns -1 with early termination. DataFusion currently only 
supports the 2-argument form. This addition enables the downstream [Apache 
DataFusion Comet](https://github.com/apache/datafusion-comet) project to 
natively support Spark's 3-argument `levenshtein` instead of falling back to 
Spark.
   
     ## What changes are included in this PR?
   
     - **`datafusion-common`**: Added `levenshtein_with_threshold` and 
`levenshtein_with_threshold_and_buffer` to the `datafusion_strsim` module. The 
algorithm uses banded dynamic programming (only computing cells within 
`threshold` distance of the diagonal) with early termination, ported from 
Apache Spark's `UTF8String.levenshteinDistance` / Apache Commons Text 
`LevenshteinDistance.limitedCompare`.
   
     - **`datafusion-functions`**: Extended `LevenshteinFunc` to accept an 
optional third `Int32` argument (with implicit coercion from `Int64` so bare 
integer literals like `2` work without explicit `::int` cast). Null threshold 
produces null output. The existing 2-argument behavior is unchanged.
   
     - **sqllogictests**: Uncommented the spark levenshtein tests and added 13 
new test cases covering various thresholds, empty strings, and null handling.
   
     ### Design note: separate implementations
   
     The threshold algorithm is kept separate from the existing 
`levenshtein_with_buffer`. The existing implementation returns `usize` 
(non-negative distance) using a generic iterator-based approach, while the 
threshold version returns `i32` (can be -1) and requires index-based access for 
the banded DP computation. We kept them separate to avoid changing the existing 
function's return type and interface, but would welcome community input on 
whether unification is preferred.
   
     ## Are these changes tested?
   
     - 4 unit tests (basic, threshold, null handling, edge cases)
     - 3 doc-tests for the new functions
     - 2 spark sqllogictest queries (2-arg and 3-arg)
     - 13 new sqllogictest queries in `string_literal.slt`
   
     ## Are there any user-facing changes?
   
     Yes. `levenshtein` now accepts an optional third argument:
   
     ```sql
     -- Existing (unchanged)
     SELECT levenshtein('kitten', 'sitting');        -- 3
   
     -- New: with threshold
     SELECT levenshtein('kitten', 'sitting', 2);     -- -1 (distance 3 > 
threshold 2)
     SELECT levenshtein('kitten', 'sitting', 5);     -- 3  (distance 3 <= 
threshold 5)
     SELECT levenshtein('kitten', 'sitting', NULL);  -- NULL
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: support optional threshold parameter for levenshtein function [datafusion]

Reply via email to