Hi, I'm a committer to the Apache Lucene project. We have randomized tests and we hit the following (simplified) scenario:
String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X"); the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16: AB𠀀C XAXBX?X?XCX Is this a bug (where should I file it) or is this something that is an inherent feature of the current implementation? Thanks, Dawid