[ https://issues.apache.org/jira/browse/CODEC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952590#comment-17952590 ]
Dianshu Liao edited comment on CODEC-330 at 5/19/25 1:00 PM: ------------------------------------------------------------- Hello [~ggregory], No, I'm not using AI to write these comments 😂 — I just want to express potential issues as clearly and thoroughly as possible. Yes, I can provide a failing unit test for this issue. I'm currently writing my tests manually using JUnit 5 with camelCase naming conventions. Here's the test code:  {code:java} package org.apache.commons.codec.language; import org.junit.jupiter.api.Test; import static org.junit.jupiter.api.Assertions.*; public class DMSoundexIndirectCleanupTest {   @Test   public void testCleanupRemovesSpecialCharacters() {     // Input containing a special character     String input = "Hello$World";     String expectedOutput = "hello$world";     DaitchMokotoffSoundex soundex = new DaitchMokotoffSoundex();     String actualOutput = soundex.cleanup(input);     assertEquals(expectedOutput, actualOutput);   } }  {code}   Currently, this test fails because the actual output is "hello$world", not "helloworld" as expected. As mentioned earlier, I believe this is due to the fact that the cleanup method does not filter out non-letter characters. If I modify the method to include:  {code:java} if (!Character.isLetter(ch)) {continue;} {code}  and run the test again, it passes, producing the correct result "helloworld". I've also tried other inputs with special characters, such as "Te#st" and "He%llo", and observed the same issue. That's why I believe this is a bug in the current implementation. Thanks again for your time!  was (Author: JIRAUSER309579): Hello [~ggregory], No, I'm not using AI to write these comments 😂 — I just want to express potential issues as clearly and thoroughly as possible. Yes, I can provide a failing unit test for this issue. I'm currently writing my tests manually using JUnit 5 with camelCase naming conventions. Here's the test code:  {code:java} package org.apache.commons.codec.language; import org.junit.jupiter.api.Test; import static org.junit.jupiter.api.Assertions.*; public class DMSoundexIndirectCleanupTest {   @Test   public void testCleanupRemovesSpecialCharacters() {     // Input containing a special character     String input = "He%llo";     String expectedOutput = "hello";     DaitchMokotoffSoundex soundex = new DaitchMokotoffSoundex();     String actualOutput = soundex.cleanup(input);     assertEquals(expectedOutput, actualOutput);   } }  {code}   Currently, this test fails because the actual output is "hello$world", not "helloworld" as expected. As mentioned earlier, I believe this is due to the fact that the cleanup method does not filter out non-letter characters. If I modify the method to include:  {code:java} if (!Character.isLetter(ch)) {continue;} {code}  and run the test again, it passes, producing the correct result "helloworld". I've also tried other inputs with special characters, such as "Te#st" and "He%llo", and observed the same issue. That's why I believe this is a bug in the current implementation. Thanks again for your time!  > org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does > not remove special characters (e.g., punctuation) > ------------------------------------------------------------------------------------------------------------------------------ > > Key: CODEC-330 > URL: https://issues.apache.org/jira/browse/CODEC-330 > Project: Commons Codec > Issue Type: Bug > Affects Versions: 1.18.0 > Environment: JDK 8, MacOS > Reporter: Dianshu Liao > Priority: Major > > Method: > org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String input) >  > h1. Problem >  > The private method {{cleanup(final String input)}} in > {{DaitchMokotoffSoundex}} is responsible for sanitizing the input string > before the phonetic encoding is applied. While it correctly removes > whitespace and performs ASCII folding, it does *not* remove non-letter > special characters such as {{{}${}}}, {{{}@{}}}, {{{}#{}}}, {{{}!{}}}, or > digits. These characters remain in the cleaned string. > As a result, special characters may interfere with phonetic rule matching in > downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially > leading to incorrect or inconsistent results. > For example, cleanup("Hello$World") -> "hello$world" > The dollar sign ({{{}${}}}) should have been removed, but it remains in the > result. > The expected result should be "helloworld" >  >  > h1. Suggested Fix >  > Modify the {{cleanup()}} method to include a check for non-letter characters: > if (!Character.isLetter(ch)) > {   continue; // Ignore non-letter characters like $, @, -, etc. } > This small change will make the method more robust when processing real-world > input strings that may contain unexpected non-letter characters. >  >  > h1. Additional Context >  > This issue was identified during unit testing using JUnit 5. After applying > the above fix, all test cases involving inputs with special characters pass > successfully. Without this fix, the current implementation fails to process > inputs containing unexpected special characters. -- This message was sent by Atlassian Jira (v8.20.10#820010)