Hi Amey,

You created a byte array from the original string (which may contain surrogate 
chars). But then you created a copy string with `final String copy = new 
String(bytes, charset);`. There will be encoding to UTF-8, which may fail to 
encode some values, leading to the error you reported I suspect.

If you try `final String copy = new String(bytes);` there will be still 
encoding to the default system charset as well.

So I think the safest is to compare codepoints. Perhaps with something like 
this:

    @Test
    public void testSubStringWithSurrogatePair() {
        for (int j = 0; j < 10; j++) {
            final int size = 5000;            RandomStringGenerator generator = 
new RandomStringGenerator.Builder().build();            String orig = 
generator.generate(size).substring(0, 2500);
            final String copy = new String(orig);
            for (int i = 0; i < orig.length() && i < copy.length(); i++) {      
          final int o = orig.codePointAt(i);                final int c = 
copy.codePointAt(i);                assertEquals(String.format("Differs where j 
= %d, i = %d, o = %d, and c = %d", j, i, o, c), o, c);            }        }
    }

Running it 10 times, I was able to consistently reproduce the initial issue. It 
would always fail, about 4 out of 10. I think [rng] or somewhere in another 
commons component I kind of remember seeing unit tests for random generated 
values using loops? But may be mistaken (I don't trust my own memory). So feel 
free to leave that part out if you prefer. I tried the code above with j going 
up to 1000. After a few seconds, the test passed too.
Doing `final String copy = new String(orig);` the value of the original string 
is completely copied onto the new string. So comparing the codepoints should do 
the trick. We may even want to add another assert statement before the for loop 
to confirm both strings have the same length?
Hope that helps,Bruno


________________________________


From: Amey Jadiye <ameyjad...@gmail.com>
To: Commons Developers List <dev@commons.apache.org> 
Sent: Monday, 11 September 2017 12:15 AM
Subject: [text] Invalid unicode sequences on .substring of RandomStringGenerator



Hi Folks,


While working on RandomStringGenerator I found when I'm doing .substring on

generated random string its failing intermittently with sequence of

surrogate pair.

same bug was raised in commons-lang

https://issues.apache.org/jira/browse/LANG-100


Is this possible bug with RandomStringGenerator ? or is this expected ?


@Test

public void testSubStringWithSurrogatePair() {

    final int size = 5000;

    final Charset charset = Charset.forName("UTF-8");

    RandomStringGenerator generator = new

RandomStringGenerator.Builder().build();

    String orig = generator.generate(size).substring(0,2500);


    final byte[] bytes = orig.getBytes(charset);

    final String copy = new String(bytes, charset);


    for (int i=0; i < orig.length() && i < copy.length(); i++) {

        final char o = orig.charAt(i);

        final char c = copy.charAt(i);

        assertEquals("differs at " + i + "(" + Integer.toHexString(new

Character(o).hashCode()) + "," +

                Integer.toHexString(new Character(c).hashCode()) + ")", o,

c);

    }


}


Regards,

Amey



---------------------------------------------------------------------

To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org

For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to