Dianshu Liao created CODEC-330: ---------------------------------- Summary: org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does not remove special characters (e.g., punctuation) Key: CODEC-330 URL: https://issues.apache.org/jira/browse/CODEC-330 Project: Commons Codec Issue Type: Bug Affects Versions: 1.18.1 Environment: JDK 8, MacOS Reporter: Dianshu Liao Attachments: Screenshot 2025-05-19 at 1.01.11 am.png
File: org.apache.commons.codec.language.DaitchMokotoffSoundex Method: private String cleanup(String input) h1. Problem The private method "private String cleanup(final String input)” in DaitchMokotoffSoundex is intended to sanitize the input string before applying the actual phonetic transformation. The implementation does not remove any special characters such as !, @, #, or numbers. These characters are preserved in the cleaned string, which can lead to incorrect or unexpected phonetic results. h1. Test Code package org.apache.commons.codec.language; import org.apache.commons.codec.language.DaitchMokotoffSoundex; import org.junit.Test; import java.lang.reflect.Method; import static org.junit.Assert.assertEquals; public class language_DaitchMokotoffSoundex_cleanup_Test { @Test(timeout = 4000) public void testCleanup() { try { // Instantiate the class DaitchMokotoffSoundex soundex = new DaitchMokotoffSoundex(); // Access the private method using reflection Method cleanupMethod = DaitchMokotoffSoundex.class.getDeclaredMethod("cleanup", String.class); cleanupMethod.setAccessible(true); // Test input with whitespace String input = " Hello World "; String expectedOutput = "helloworld"; String actualOutput = (String) cleanupMethod.invoke(soundex, input); assertEquals(expectedOutput, actualOutput); // Test input with special characters input = "Te$t!@#"; expectedOutput = "test"; actualOutput = (String) cleanupMethod.invoke(soundex, input); assertEquals(expectedOutput, actualOutput); } catch (Exception e) { e.printStackTrace(); } } h1. } Expected Result All non-letter characters (e.g., !, @, #, digits) should be removed as part of the cleanup process to ensure reliable phonetic encoding. h1. Actual Result Special characters are preserved. For example "Te$t!@#" -> "te$t!@#" -- This message was sent by Atlassian Jira (v8.20.10#820010)