Hi Amey, You created a byte array from the original string (which may contain surrogate chars). But then you created a copy string with `final String copy = new String(bytes, charset);`. There will be encoding to UTF-8, which may fail to encode some values, leading to the error you reported I suspect.
If you try `final String copy = new String(bytes);` there will be still encoding to the default system charset as well. So I think the safest is to compare codepoints. Perhaps with something like this: @Test public void testSubStringWithSurrogatePair() { for (int j = 0; j < 10; j++) { final int size = 5000; RandomStringGenerator generator = new RandomStringGenerator.Builder().build(); String orig = generator.generate(size).substring(0, 2500); final String copy = new String(orig); for (int i = 0; i < orig.length() && i < copy.length(); i++) { final int o = orig.codePointAt(i); final int c = copy.codePointAt(i); assertEquals(String.format("Differs where j = %d, i = %d, o = %d, and c = %d", j, i, o, c), o, c); } } } Running it 10 times, I was able to consistently reproduce the initial issue. It would always fail, about 4 out of 10. I think [rng] or somewhere in another commons component I kind of remember seeing unit tests for random generated values using loops? But may be mistaken (I don't trust my own memory). So feel free to leave that part out if you prefer. I tried the code above with j going up to 1000. After a few seconds, the test passed too. Doing `final String copy = new String(orig);` the value of the original string is completely copied onto the new string. So comparing the codepoints should do the trick. We may even want to add another assert statement before the for loop to confirm both strings have the same length? Hope that helps,Bruno ________________________________ From: Amey Jadiye <ameyjad...@gmail.com> To: Commons Developers List <dev@commons.apache.org> Sent: Monday, 11 September 2017 12:15 AM Subject: [text] Invalid unicode sequences on .substring of RandomStringGenerator Hi Folks, While working on RandomStringGenerator I found when I'm doing .substring on generated random string its failing intermittently with sequence of surrogate pair. same bug was raised in commons-lang https://issues.apache.org/jira/browse/LANG-100 Is this possible bug with RandomStringGenerator ? or is this expected ? @Test public void testSubStringWithSurrogatePair() { final int size = 5000; final Charset charset = Charset.forName("UTF-8"); RandomStringGenerator generator = new RandomStringGenerator.Builder().build(); String orig = generator.generate(size).substring(0,2500); final byte[] bytes = orig.getBytes(charset); final String copy = new String(bytes, charset); for (int i=0; i < orig.length() && i < copy.length(); i++) { final char o = orig.charAt(i); final char c = copy.charAt(i); assertEquals("differs at " + i + "(" + Integer.toHexString(new Character(o).hashCode()) + "," + Integer.toHexString(new Character(c).hashCode()) + ")", o, c); } } Regards, Amey --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org