[jira] [Comment Edited] (CODEC-330) org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does not remove special characters (e.g., punctuation)

Dianshu Liao (Jira) Mon, 19 May 2025 06:08:05 -0700


    [ 
https://issues.apache.org/jira/browse/CODEC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952590#comment-17952590
 ]


Dianshu Liao edited comment on CODEC-330 at 5/19/25 1:07 PM:
-------------------------------------------------------------

Hello [~ggregory],

No, I'm not using AI to write these comments 😂 — I just want to express 
potential issues as clearly and thoroughly as possible.

Yes, I can provide a failing unit test for this issue.

I'm currently writing my tests manually using JUnit 5 with camelCase naming 
conventions. Here's the test code:

 
{code:java}
package org.apache.commons.codec.language;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
public class DMSoundexIndirectCleanupTest {
    @Test
    public void testCleanupRemovesSpecialCharacters() {
        // Input containing a special character
        String input = "Hello$World";
        String expectedOutput = "hello$world";
        DaitchMokotoffSoundex soundex = new DaitchMokotoffSoundex();
        String actualOutput = soundex.cleanup(input);
        assertEquals(expectedOutput, actualOutput);
    }
}
  {code}
 

 

Currently, this test fails because the actual output is "hello$world", not 
"helloworld" as expected.

As mentioned earlier, I believe this is due to the fact that the cleanup method 
does not filter out non-letter characters. If I modify the method to include:

 
{code:java}
if (!Character.isLetter(ch)) {continue;} {code}
 

After change, the fixed code of "cleanup" should be:


{code:java}
final StringBuilder sb = new StringBuilder();
for (char ch : input.toCharArray()) {
    if (Character.isWhitespace(ch)) {
        continue;
    }
    if (!Character.isLetter(ch)) {continue;} //added line

    ch = Character.toLowerCase(ch);
    final Character character = FOLDINGS.get(ch);
    if (folding && character != null) {
        ch = character;
    }
    sb.append(ch);
}
return sb.toString(); {code}
 

and run the test again, it passes, producing the correct result "helloworld".

I've also tried other inputs with special characters, such as "Te#st" and 
"He%llo", and observed the same issue. That's why I believe this is a bug in 
the current implementation.

Thanks again for your time!

 


was (Author: JIRAUSER309579):
Hello [~ggregory],

No, I'm not using AI to write these comments 😂 — I just want to express 
potential issues as clearly and thoroughly as possible.

Yes, I can provide a failing unit test for this issue.

I'm currently writing my tests manually using JUnit 5 with camelCase naming 
conventions. Here's the test code:

 
{code:java}
package org.apache.commons.codec.language;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
public class DMSoundexIndirectCleanupTest {
    @Test
    public void testCleanupRemovesSpecialCharacters() {
        // Input containing a special character
        String input = "Hello$World";
        String expectedOutput = "hello$world";
        DaitchMokotoffSoundex soundex = new DaitchMokotoffSoundex();
        String actualOutput = soundex.cleanup(input);
        assertEquals(expectedOutput, actualOutput);
    }
}
  {code}
 

 

Currently, this test fails because the actual output is "hello$world", not 
"helloworld" as expected.

As mentioned earlier, I believe this is due to the fact that the cleanup method 
does not filter out non-letter characters. If I modify the method to include:

 
{code:java}
if (!Character.isLetter(ch)) {continue;} {code}
 

and run the test again, it passes, producing the correct result "helloworld".

I've also tried other inputs with special characters, such as "Te#st" and 
"He%llo", and observed the same issue. That's why I believe this is a bug in 
the current implementation.

Thanks again for your time!

 

> org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does 
> not remove special characters (e.g., punctuation)
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CODEC-330
>                 URL: https://issues.apache.org/jira/browse/CODEC-330
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.18.0
>         Environment: JDK 8, MacOS
>            Reporter: Dianshu Liao
>            Priority: Major
>
> Method: 
> org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String input)
>  
> h1. Problem
>  
> The private method {{cleanup(final String input)}} in 
> {{DaitchMokotoffSoundex}} is responsible for sanitizing the input string 
> before the phonetic encoding is applied. While it correctly removes 
> whitespace and performs ASCII folding, it does *not* remove non-letter 
> special characters such as {{{}${}}}, {{{}@{}}}, {{{}#{}}}, {{{}!{}}}, or 
> digits. These characters remain in the cleaned string.
> As a result, special characters may interfere with phonetic rule matching in 
> downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially 
> leading to incorrect or inconsistent results.
> For example, cleanup("Hello$World") -> "hello$world"
> The dollar sign ({{{}${}}}) should have been removed, but it remains in the 
> result.
> The expected result should be "helloworld"
>  
>  
> h1. Suggested Fix
>  
> Modify the {{cleanup()}} method to include a check for non-letter characters:
> if (!Character.isLetter(ch))
> {     continue; // Ignore non-letter characters like $, @, -, etc. }
> This small change will make the method more robust when processing real-world 
> input strings that may contain unexpected non-letter characters.
>  
>  
> h1. Additional Context
>  
> This issue was identified during unit testing using JUnit 5. After applying 
> the above fix, all test cases involving inputs with special characters pass 
> successfully. Without this fix, the current implementation fails to process 
> inputs containing unexpected special characters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CODEC-330) org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does not remove special characters (e.g., punctuation)

Reply via email to