[ https://issues.apache.org/jira/browse/CODEC-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dianshu Liao updated CODEC-331: ------------------------------- Description: Component: org.apache.commons.codec.language.bm.Rule Method: private static PhonemeExpr parsePhonemeExpr(String ph) h1. Problem When the input string is *(()|)* The method *parsePhonemeExpr(String)* first strips the parentheses, producing: *body = "()|”* Then it executes *body.split("[|]")* Due to Java's default behavior, the trailing empty string (after the {*}|{*}) is discarded, resulting in *["()"]* To compensate for this, the following logic is used: if (body.startsWith("|") || body.endsWith("|")) { phs.add(new Phoneme("", Languages.ANY_LANGUAGE)); } However, the *"()"* entry already results in a *Phoneme("")* when parsed. As a result, the list ends up containing two empty phonemes, which seems unintended. h1. Expected Result Only one empty phoneme should be added for (()|). h1. Actual Result Two empty phonemes are returned: - One from parsing "()" - One manually added due to .endsWith("|") was: Component: org.apache.commons.codec.language.bm.Rule Method: private static PhonemeExpr parsePhonemeExpr(String ph) h1. Problem When the input string is *(()|)* The method *parsePhonemeExpr(String)* first strips the parentheses, producing: *body = "()|”* Then it executes *body.split("[|]")* Due to Java's default behavior, the trailing empty string (after the {*}|{*}) is discarded, resulting in *["()"]* To compensate for this, the following logic is used: if (body.startsWith("|") || body.endsWith("|")) { phs.add(new Phoneme("", Languages.ANY_LANGUAGE)); } However, the *"()"* entry already results in a *Phoneme("")* when parsed. As a result, the list ends up containing two empty phonemes, which seems unintended. h1. Test Code package org.apache.commons.codec.language.bm; import org.apache.commons.codec.language.bm.Rule; import org.apache.commons.codec.language.bm.Rule.Phoneme; import org.apache.commons.codec.language.bm.Rule.PhonemeExpr; import org.apache.commons.codec.language.bm.Rule.PhonemeList; import org.apache.commons.codec.language.bm.Languages; import org.junit.Test; import static org.junit.Assert.*; import java.lang.reflect.Method; import java.util.ArrayList; import java.util.List; public class language_bm_Rule_parsePhonemeExpr_Test { @Test(timeout = 4000) public void testParsePhonemeExpr_withEmptyBracketedInput() { String input = "(()|)"; try { Method method = Rule.class.getDeclaredMethod("parsePhonemeExpr", String.class); method.setAccessible(true); PhonemeExpr result = (PhonemeExpr) method.invoke(null, input); PhonemeList phonemeList = (PhonemeList) result; assertEquals(1, phonemeList.size()); // Expecting one empty phoneme } catch (Exception e) { fail("Exception should not have been thrown: " + e.getMessage()); } } } h1. Expected Result Only one empty phoneme should be added for (()|). h1. Actual Result Two empty phonemes are returned: - One from parsing "()" - One manually added due to .endsWith("|") > org.apache.commons.codec.language.bm.Rule.parsePhonemeExpr(String) adds > duplicate empty phoneme when input ends with | > ---------------------------------------------------------------------------------------------------------------------- > > Key: CODEC-331 > URL: https://issues.apache.org/jira/browse/CODEC-331 > Project: Commons Codec > Issue Type: Bug > Affects Versions: 1.18.0 > Environment: Affected Version: 1.18.1 (I found this version from my > pom.xml) > MacOS > JDK 8 > Reporter: Dianshu Liao > Priority: Major > Attachments: Screenshot 2025-05-19 at 8.11.02 am.png > > > Component: org.apache.commons.codec.language.bm.Rule > Method: private static PhonemeExpr parsePhonemeExpr(String ph) > > h1. Problem > When the input string is *(()|)* > The method *parsePhonemeExpr(String)* first strips the parentheses, > producing: *body = "()|”* > Then it executes *body.split("[|]")* > Due to Java's default behavior, the trailing empty string (after the {*}|{*}) > is discarded, resulting in *["()"]* > To compensate for this, the following logic is used: > if (body.startsWith("|") || body.endsWith("|")) > { phs.add(new Phoneme("", Languages.ANY_LANGUAGE)); } > However, the *"()"* entry already results in a *Phoneme("")* when parsed. > As a result, the list ends up containing two empty phonemes, which seems > unintended. > h1. Expected Result > Only one empty phoneme should be added for (()|). > > h1. Actual Result > > Two empty phonemes are returned: > - One from parsing "()" > - One manually added due to .endsWith("|") > -- This message was sent by Atlassian Jira (v8.20.10#820010)