[ 
https://issues.apache.org/jira/browse/CODEC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952646#comment-17952646
 ] 

Gary D. Gregory edited comment on CODEC-330 at 5/19/25 2:56 PM:
----------------------------------------------------------------

Hello [~ilikecode]

Unfortunately the test above is irrelevant and the method is now private. If 
you think there is a bug in the Soundex produced by the class then please test 
for that, not the internals. Please see {{{}DaitchMokotoffSoundexTest{}}}.


was (Author: garydgregory):
Hello [~ilikecode]

Unfortunately the test above is irrelevant and the method is now private. If 
you think there is a bug in the Soundex produced by the clas then please test 
for that, not the internals. Please see {{{}DaitchMokotoffSoundexTest{}}}.

> org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does 
> not remove special characters (e.g., punctuation)
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CODEC-330
>                 URL: https://issues.apache.org/jira/browse/CODEC-330
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.18.0
>         Environment: JDK 8, MacOS
>            Reporter: Dianshu Liao
>            Priority: Major
>
> Method: 
> org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String input)
>  
> h1. Problem
>  
> The private method {{cleanup(final String input)}} in 
> {{DaitchMokotoffSoundex}} is responsible for sanitizing the input string 
> before the phonetic encoding is applied. While it correctly removes 
> whitespace and performs ASCII folding, it does *not* remove non-letter 
> special characters such as {{{}${}}}, {{{}@{}}}, {{{}#{}}}, {{{}!{}}}, or 
> digits. These characters remain in the cleaned string.
> As a result, special characters may interfere with phonetic rule matching in 
> downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially 
> leading to incorrect or inconsistent results.
> For example, cleanup("Hello$World") -> "hello$world"
> The dollar sign ({{{}${}}}) should have been removed, but it remains in the 
> result.
> The expected result should be "helloworld"
>  
>  
> h1. Suggested Fix
>  
> Modify the {{cleanup()}} method to include a check for non-letter characters:
> if (!Character.isLetter(ch))
> {     continue; // Ignore non-letter characters like $, @, -, etc. }
> This small change will make the method more robust when processing real-world 
> input strings that may contain unexpected non-letter characters.
>  
>  
> h1. Additional Context
>  
> This issue was identified during unit testing using JUnit 5. After applying 
> the above fix, all test cases involving inputs with special characters pass 
> successfully. Without this fix, the current implementation fails to process 
> inputs containing unexpected special characters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to