rmuir commented on issue #13271:
URL: https://github.com/apache/lucene/issues/13271#issuecomment-2038141226
I don't understand what that change has to do with analysis chain...
inconsistent offsets has to do with what TokenStream is doing not the index. Be
sure, that you aren't getting confused by the fact this failure does not
reproduce 100% of the time.
as far as the ICU charfilter, but i added some prints so we can see what's
happening:
```
2> TEST FAIL: useCharFilter=true
text='\u0003\ufb87\udacd\uddf6d\uf1f4\u02e6\u89f8\uda06\udfd2\u01e6'
1> Normalizer2: NFKC
2> Exception from random analyzer:
...
```
So ultimately on this string, the ICU charfilter will only change one
character (the arabic presentation form FB87 to 068E). it won't change the
length of the string in UTF-16 nor impact any offsets:
```
-
\u0003\uFB87\uDACD\uDDF6\u0064\uF1F4\u02E6\u89F8\uDA06\uDFD2\u005C\u0075\u0030\u0031\u0065\u0036
+
\u0003\u068E\uDACD\uDDF6\u0064\uF1F4\u02E6\u89F8\uDA06\uDFD2\u005C\u0075\u0030\u0031\u0065\u0036
```
But that charfilter tries to do this incrementally, so it could have some
bugs based on how data is being "spoon-fed" to the charfilter (spoonfeeding is
happening: that's the `useCharFilter=true`). I suspect the bug still be in the
charfilter logic...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]