[ 
https://issues.apache.org/jira/browse/CODEC-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dianshu Liao updated CODEC-331:
-------------------------------
    Description: 
Component: org.apache.commons.codec.language.bm.Rule

Method: private static PhonemeExpr parsePhonemeExpr(String ph)

 
h1. Problem

When the input string is *(()|)*

The method *parsePhonemeExpr(String)* first strips the parentheses, producing: 
*body = "()|”*
Then it executes *body.split("[|]")*
Due to Java's default behavior, the trailing empty string (after the {*}|{*}) 
is discarded, resulting in *["()"]*
To compensate for this, the following logic is used:
if (body.startsWith("|") || body.endsWith("|"))

{     phs.add(new Phoneme("", Languages.ANY_LANGUAGE)); }

However, the *"()"* entry already results in a *Phoneme("")* when parsed.
As a result, the list ends up containing two empty phonemes, which seems 
unintended.
h1. Expected Result

Only one empty phoneme should be added for (()|).

 
h1. Actual Result

 

Two empty phonemes are returned:
 - One from parsing "()"

 - One manually added due to .endsWith("|")

 

  was:
Component: org.apache.commons.codec.language.bm.Rule

Method: private static PhonemeExpr parsePhonemeExpr(String ph)

 
h1. Problem

When the input string is *(()|)*

The method *parsePhonemeExpr(String)* first strips the parentheses, producing: 
*body = "()|”*
Then it executes *body.split("[|]")*
Due to Java's default behavior, the trailing empty string (after the {*}|{*}) 
is discarded, resulting in *["()"]*
To compensate for this, the following logic is used:
if (body.startsWith("|") || body.endsWith("|")) {
    phs.add(new Phoneme("", Languages.ANY_LANGUAGE));
}
However, the *"()"* entry already results in a *Phoneme("")* when parsed.
As a result, the list ends up containing two empty phonemes, which seems 
unintended.

 
h1. Test Code

 

package org.apache.commons.codec.language.bm;
import org.apache.commons.codec.language.bm.Rule;
import org.apache.commons.codec.language.bm.Rule.Phoneme;
import org.apache.commons.codec.language.bm.Rule.PhonemeExpr;
import org.apache.commons.codec.language.bm.Rule.PhonemeList;
import org.apache.commons.codec.language.bm.Languages;
import org.junit.Test;
import static org.junit.Assert.*;
import java.lang.reflect.Method;
import java.util.ArrayList;
import java.util.List;

public class language_bm_Rule_parsePhonemeExpr_Test {

    @Test(timeout = 4000)
    public void testParsePhonemeExpr_withEmptyBracketedInput() {
        String input = "(()|)";
        try {
            Method method = Rule.class.getDeclaredMethod("parsePhonemeExpr", 
String.class);
            method.setAccessible(true);
            PhonemeExpr result = (PhonemeExpr) method.invoke(null, input);


            PhonemeList phonemeList = (PhonemeList) result;
            assertEquals(1, phonemeList.size()); // Expecting one empty phoneme
        } catch (Exception e) {
            fail("Exception should not have been thrown: " + e.getMessage());
        }
    }

}

 
h1. Expected Result



Only one empty phoneme should be added for (()|).

 
h1. Actual Result

 

Two empty phonemes are returned:

- One from parsing "()"

- One manually added due to .endsWith("|")

 


> org.apache.commons.codec.language.bm.Rule.parsePhonemeExpr(String) adds 
> duplicate empty phoneme when input ends with |
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CODEC-331
>                 URL: https://issues.apache.org/jira/browse/CODEC-331
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.18.0
>         Environment: Affected Version: 1.18.1 (I found this version from my 
> pom.xml)
> MacOS
> JDK 8
>            Reporter: Dianshu Liao
>            Priority: Major
>         Attachments: Screenshot 2025-05-19 at 8.11.02 am.png
>
>
> Component: org.apache.commons.codec.language.bm.Rule
> Method: private static PhonemeExpr parsePhonemeExpr(String ph)
>  
> h1. Problem
> When the input string is *(()|)*
> The method *parsePhonemeExpr(String)* first strips the parentheses, 
> producing: *body = "()|”*
> Then it executes *body.split("[|]")*
> Due to Java's default behavior, the trailing empty string (after the {*}|{*}) 
> is discarded, resulting in *["()"]*
> To compensate for this, the following logic is used:
> if (body.startsWith("|") || body.endsWith("|"))
> {     phs.add(new Phoneme("", Languages.ANY_LANGUAGE)); }
> However, the *"()"* entry already results in a *Phoneme("")* when parsed.
> As a result, the list ends up containing two empty phonemes, which seems 
> unintended.
> h1. Expected Result
> Only one empty phoneme should be added for (()|).
>  
> h1. Actual Result
>  
> Two empty phonemes are returned:
>  - One from parsing "()"
>  - One manually added due to .endsWith("|")
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to