[ 
https://issues.apache.org/jira/browse/CODEC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952523#comment-17952523
 ] 

Dianshu Liao commented on CODEC-330:
------------------------------------

Hi [~ggregory],

Thanks for your guidance!

I've reviewed how the "{{{}cleanup"{}}} method is used in the project.

It is invoked by the "{{{}soundex(final String source, final boolean 
branching)"{}}} method in 
"{{{}org/apache/commons/codec/language/DaitchMokotoffSoundex.java"{}}}, 
specifically at this line: "final String input = cleanup(source);"

The "{{{}cleanup"{}}} method currently removes whitespace and performs ASCII 
folding as expected. For example,
 * {{"Hello World"}} → {{"helloworld"}}

 * {{"HeLLo WoRLD"}} → {{"helloworld"}}

 * {{"JaVa PrOgRaMmInG"}} → {{"javaprogramming"}}

However, I found that {{"cleanup"}} does not handle unexpected special 
characters that appear between letters. For example:
 * {{"Hello$World"}} -> {{"hello$world"}}

In this case, the dollar sign ({{{}${}}}) remains in the cleaned string, which 
is undesirable. 

It should be removed so that the result becomes {{{}"helloworld"{}}}. 
Otherwise, this may interfere with Soundex rule matching or cause inconsistent 
results.

I believe this can be addressed with a small change to the "{{{}cleanup"{}}} 
method by adding a check to skip non-letter characters:

if (!Character.isLetter(ch)) {
    continue; // Ignore non-letter characters like $, @, -, etc.
}

 

This enhancement would improve the robustness of the 
{{{}"DaitchMokotoffSoundex{}}}" implementation, especially when dealing with 
user input that may contain accidental non-alphabetic characters.

As per your earlier recommendation, I’m now using JUnit 5 to validate this 
behavior. With the "{{{}Character.isLetter(ch)"{}}} check added, all edge cases 
are handled correctly and all tests pass. Without this change, the current 
implementation fails to process inputs containing unexpected special characters.

Thanks for considering this!

 

> org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does 
> not remove special characters (e.g., punctuation)
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CODEC-330
>                 URL: https://issues.apache.org/jira/browse/CODEC-330
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.18.0
>         Environment: JDK 8, MacOS
>            Reporter: Dianshu Liao
>            Priority: Major
>         Attachments: Screenshot 2025-05-19 at 1.01.11 am.png
>
>
> File: org.apache.commons.codec.language.DaitchMokotoffSoundex
> Method: private String cleanup(String input)
> h1. 
> Problem
> The private method "private String cleanup(final String input)” in 
> DaitchMokotoffSoundex is intended to sanitize the input string before 
> applying the actual phonetic transformation. The implementation does not 
> remove any special characters such as !, @, #, or numbers. These characters 
> are preserved in the cleaned string, which can lead to incorrect or 
> unexpected phonetic results.
>  
> h1. Test Code
> package org.apache.commons.codec.language;
> import org.apache.commons.codec.language.DaitchMokotoffSoundex;
> import org.junit.Test;
> import java.lang.reflect.Method;
> import static org.junit.Assert.assertEquals;
> public class language_DaitchMokotoffSoundex_cleanup_Test {
>     @Test(timeout = 4000)
>     public void testCleanup() {
>         try {
>             // Instantiate the class
>             DaitchMokotoffSoundex soundex = new DaitchMokotoffSoundex();
>             // Access the private method using reflection
>             Method cleanupMethod = 
> DaitchMokotoffSoundex.class.getDeclaredMethod("cleanup", String.class);
>             cleanupMethod.setAccessible(true);
>             // Test input with whitespace
>             String input = "  Hello World  ";
>             String expectedOutput = "helloworld";
>             String actualOutput = (String) cleanupMethod.invoke(soundex, 
> input);
>             assertEquals(expectedOutput, actualOutput);
>             // Test input with special characters
>             input = "Te$t!@#";
>             expectedOutput = "test";
>             actualOutput = (String) cleanupMethod.invoke(soundex, input);
>             assertEquals(expectedOutput, actualOutput);
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
> h1. }
> Expected Result
> All non-letter characters (e.g., !, @, #, digits) should be removed as part 
> of the cleanup process to ensure reliable phonetic encoding.
> h1. 
> Actual Result
>  
> Special characters are preserved. For example "Te$t!@#" -> "te$t!@#"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to