[ https://issues.apache.org/jira/browse/CODEC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952523#comment-17952523 ]
Dianshu Liao commented on CODEC-330: ------------------------------------ Hi [~ggregory], Thanks for your guidance! I've reviewed how the "{{{}cleanup"{}}} method is used in the project. It is invoked by the "{{{}soundex(final String source, final boolean branching)"{}}} method in "{{{}org/apache/commons/codec/language/DaitchMokotoffSoundex.java"{}}}, specifically at this line: "final String input = cleanup(source);" The "{{{}cleanup"{}}} method currently removes whitespace and performs ASCII folding as expected. For example, * {{"Hello World"}} → {{"helloworld"}} * {{"HeLLo WoRLD"}} → {{"helloworld"}} * {{"JaVa PrOgRaMmInG"}} → {{"javaprogramming"}} However, I found that {{"cleanup"}} does not handle unexpected special characters that appear between letters. For example: * {{"Hello$World"}} -> {{"hello$world"}} In this case, the dollar sign ({{{}${}}}) remains in the cleaned string, which is undesirable. It should be removed so that the result becomes {{{}"helloworld"{}}}. Otherwise, this may interfere with Soundex rule matching or cause inconsistent results. I believe this can be addressed with a small change to the "{{{}cleanup"{}}} method by adding a check to skip non-letter characters: if (!Character.isLetter(ch)) { continue; // Ignore non-letter characters like $, @, -, etc. } This enhancement would improve the robustness of the {{{}"DaitchMokotoffSoundex{}}}" implementation, especially when dealing with user input that may contain accidental non-alphabetic characters. As per your earlier recommendation, I’m now using JUnit 5 to validate this behavior. With the "{{{}Character.isLetter(ch)"{}}} check added, all edge cases are handled correctly and all tests pass. Without this change, the current implementation fails to process inputs containing unexpected special characters. Thanks for considering this! > org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does > not remove special characters (e.g., punctuation) > ------------------------------------------------------------------------------------------------------------------------------ > > Key: CODEC-330 > URL: https://issues.apache.org/jira/browse/CODEC-330 > Project: Commons Codec > Issue Type: Bug > Affects Versions: 1.18.0 > Environment: JDK 8, MacOS > Reporter: Dianshu Liao > Priority: Major > Attachments: Screenshot 2025-05-19 at 1.01.11 am.png > > > File: org.apache.commons.codec.language.DaitchMokotoffSoundex > Method: private String cleanup(String input) > h1. > Problem > The private method "private String cleanup(final String input)” in > DaitchMokotoffSoundex is intended to sanitize the input string before > applying the actual phonetic transformation. The implementation does not > remove any special characters such as !, @, #, or numbers. These characters > are preserved in the cleaned string, which can lead to incorrect or > unexpected phonetic results. > > h1. Test Code > package org.apache.commons.codec.language; > import org.apache.commons.codec.language.DaitchMokotoffSoundex; > import org.junit.Test; > import java.lang.reflect.Method; > import static org.junit.Assert.assertEquals; > public class language_DaitchMokotoffSoundex_cleanup_Test { > @Test(timeout = 4000) > public void testCleanup() { > try { > // Instantiate the class > DaitchMokotoffSoundex soundex = new DaitchMokotoffSoundex(); > // Access the private method using reflection > Method cleanupMethod = > DaitchMokotoffSoundex.class.getDeclaredMethod("cleanup", String.class); > cleanupMethod.setAccessible(true); > // Test input with whitespace > String input = " Hello World "; > String expectedOutput = "helloworld"; > String actualOutput = (String) cleanupMethod.invoke(soundex, > input); > assertEquals(expectedOutput, actualOutput); > // Test input with special characters > input = "Te$t!@#"; > expectedOutput = "test"; > actualOutput = (String) cleanupMethod.invoke(soundex, input); > assertEquals(expectedOutput, actualOutput); > } catch (Exception e) { > e.printStackTrace(); > } > } > h1. } > Expected Result > All non-letter characters (e.g., !, @, #, digits) should be removed as part > of the cleanup process to ensure reliable phonetic encoding. > h1. > Actual Result > > Special characters are preserved. For example "Te$t!@#" -> "te$t!@#" -- This message was sent by Atlassian Jira (v8.20.10#820010)