Re: [PR] Spark soundex function implementation [datafusion]

via GitHub Sat, 14 Mar 2026 00:25:27 -0700


davidlghellin commented on code in PR #20725:
URL: https://github.com/apache/datafusion/pull/20725#discussion_r2934869806



##########
datafusion/sqllogictest/test_files/spark/string/soundex.slt:
##########
@@ -15,13 +15,37 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# This file was originally created by a porting script from:
-#   
https://github.com/lakehq/sail/tree/43b6ed8221de5c4c4adbedbb267ae1351158b43c/crates/sail-spark-connect/tests/gold_data/function
-# This file is part of the implementation of the datafusion-spark function 
library.
-# For more information, please see:
-#   https://github.com/apache/datafusion/issues/15914
-
-## Original Query: SELECT soundex('Miller');
-## PySpark 3.5.5 Result: {'soundex(Miller)': 'M460', 
'typeof(soundex(Miller))': 'string', 'typeof(Miller)': 'string'}
-#query
-#SELECT soundex('Miller'::string);
+query T
+SELECT soundex('Miller');
+----
+M460
+
+query T
+SELECT soundex(NULL);
+----
+NULL
+
+query T
+SELECT soundex('');
+----
+(empty)
+
+query T
+SELECT soundex('Apache Spark');
+----
+A122
+
+query T
+SELECT soundex('123');
+----
+123
+
+query T
+SELECT soundex('a123');
+----
+A000
+
+query T
+SELECT soundex('Datafusion');
+----
+D312

Review Comment:
   
   Hey! I had actually started working on a Spark soundex implementation too 
and didn't realize there was already a PR for it. Happy to see this moving 
forward!
   
   I had put together a battery of edge-case tests validated against Spark JVM 
that might be useful. The current SLT coverage is a bit thin — there are some 
tricky Soundex behaviors that are easy to get wrong:
   ```python
   tests = [
       # H/W transparency (must NOT separate same codes)
       ("H/W transparency", "SELECT soundex('Ashcroft') AS result"),
       # Separators (digit, space, vowel MUST separate same codes)
       ("Digit separates same-code", "SELECT soundex('B1B') AS result"),
       ("Space separates same-code", "SELECT soundex('B B') AS result"),
       ("Vowel separates same-code", "SELECT soundex('BAB') AS result"),
       # Non-alpha first character (returns input unchanged)
       ("Non-alpha first char", "SELECT soundex('#hello') AS result"),
       ("Space first char", "SELECT soundex(' hello') AS result"),
       ("Only spaces", "SELECT soundex('   ') AS result"),
       ("Tab prefix", "SELECT soundex('\thello') AS result"),
       ("Emoji prefix", "SELECT soundex('😀hello') AS result"),
       ("Only digits", "SELECT soundex('123') AS result"),
       ("Starts with digit", "SELECT soundex('1abc') AS result"),
       # Basic behavior
       ("Single character", "SELECT soundex('A') AS result"),
       ("All same-code letters", "SELECT soundex('BFPV') AS result"),
       ("Similar names Robert", "SELECT soundex('Robert') AS result"),
       ("Similar names Rupert", "SELECT soundex('Rupert') AS result"),
       ("NULL", "SELECT soundex(NULL) AS result"),
       ("Empty string", "SELECT soundex('') AS result"),
       # Case insensitivity
       ("Lowercase", "SELECT soundex('robert') AS result"),
       ("Mixed case same", "SELECT soundex('rObErT') AS result"),
       # Unicode
       ("Unicode umlaut", "SELECT soundex('Müller') AS result"),
       # Truncation (only first 3 codes after initial)
       ("Long string", "SELECT soundex('Abcdefghijklmnop') AS result"),
       # Extra edge cases
       ("Adjacent same codes collapse", "SELECT soundex('Lloyd') AS result"),
       ("W between same codes", "SELECT soundex('BWB') AS result"),
       ("H between same codes", "SELECT soundex('BHB') AS result"),
       ("Double letters", "SELECT soundex('Tymczak') AS result"),
       ("All vowels after first", "SELECT soundex('Aeiou') AS result"),
       ("First char digit rest alpha", "SELECT soundex('1Robert') AS result"),
       ("Hyphen in name", "SELECT soundex('Smith-Jones') AS result"),
       ("Single non-alpha", "SELECT soundex('#') AS result"),
       ("Newline prefix", "SELECT soundex('\nhello') AS result"),
   ]
   
   for label, sql in tests:
       r = spark.sql(sql).collect()
       print(f"{label}: {repr(r[0].result)}")
   
   # Multi-row column test
   print("\nColumn test:")
   spark.sql("""
       SELECT soundex(name) AS result 
       FROM VALUES ('Robert'), ('Rupert'), (NULL), (''), ('123') AS t(name)
   """).show()
   ```
   
   Spark-3.5
   
   <img width="358" height="594" alt="Image" 
src="https://github.com/user-attachments/assets/3d18c6fc-3682-413c-ae3c-84f5b74120ff";
 />



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark soundex function implementation [datafusion]

Reply via email to