Empty regexp replaceall and surrogate pairs results in corrupted utf16

Dawid Weiss Sun, 27 May 2012 05:29:17 -0700

Hi, I'm a committer to the Apache Lucene project. We have randomized
tests and we hit the following (simplified) scenario:


    String s1 = "AB\uD840\uDC00C";
    String s2 = s1.replaceAll("", "X");

the input contains an extended unicode character (any surrogate pair
will do). The pattern is an empty string (in fact, it was randomized
as "]|" but it's the same problem so I omit the details). The problem
is that after applying this pattern, replaceAll inserts X in between
the surrogate pair characters and this results in invalid UTF-16:

AB𠀀C
XAXBX?X?XCX

Is this a bug (where should I file it) or is this something that is an
inherent feature of the current implementation? Thanks,

Dawid

Empty regexp replaceall and surrogate pairs results in corrupted utf16

Reply via email to