[ https://issues.apache.org/jira/browse/CODEC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952646#comment-17952646 ]
Gary D. Gregory edited comment on CODEC-330 at 5/19/25 2:56 PM: ---------------------------------------------------------------- Hello [~ilikecode] Unfortunately the test above is irrelevant and the method is now private. If you think there is a bug in the Soundex produced by the class then please test for that, not the internals. Please see {{{}DaitchMokotoffSoundexTest{}}}. was (Author: garydgregory): Hello [~ilikecode] Unfortunately the test above is irrelevant and the method is now private. If you think there is a bug in the Soundex produced by the clas then please test for that, not the internals. Please see {{{}DaitchMokotoffSoundexTest{}}}. > org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does > not remove special characters (e.g., punctuation) > ------------------------------------------------------------------------------------------------------------------------------ > > Key: CODEC-330 > URL: https://issues.apache.org/jira/browse/CODEC-330 > Project: Commons Codec > Issue Type: Bug > Affects Versions: 1.18.0 > Environment: JDK 8, MacOS > Reporter: Dianshu Liao > Priority: Major > > Method: > org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String input) > > h1. Problem > > The private method {{cleanup(final String input)}} in > {{DaitchMokotoffSoundex}} is responsible for sanitizing the input string > before the phonetic encoding is applied. While it correctly removes > whitespace and performs ASCII folding, it does *not* remove non-letter > special characters such as {{{}${}}}, {{{}@{}}}, {{{}#{}}}, {{{}!{}}}, or > digits. These characters remain in the cleaned string. > As a result, special characters may interfere with phonetic rule matching in > downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially > leading to incorrect or inconsistent results. > For example, cleanup("Hello$World") -> "hello$world" > The dollar sign ({{{}${}}}) should have been removed, but it remains in the > result. > The expected result should be "helloworld" > > > h1. Suggested Fix > > Modify the {{cleanup()}} method to include a check for non-letter characters: > if (!Character.isLetter(ch)) > { continue; // Ignore non-letter characters like $, @, -, etc. } > This small change will make the method more robust when processing real-world > input strings that may contain unexpected non-letter characters. > > > h1. Additional Context > > This issue was identified during unit testing using JUnit 5. After applying > the above fix, all test cases involving inputs with special characters pass > successfully. Without this fix, the current implementation fails to process > inputs containing unexpected special characters. -- This message was sent by Atlassian Jira (v8.20.10#820010)