I just stumbled upon this stop word appearing in one of our indexes: thе
Look closely. Can you see it? I doubt - I couldn't either. This is the hex dump of that: 74 68 d0 b5 which means thе and the are two different things. Here's the unicode letter after "th": https://www.fileformat.info/info/unicode/char/0435/index.htm To my surprise, I couldn't find it in the ascii folding filter: https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java Anybody remembers whether the omission of Cyrillic characters was intentional (there is quite a few of them that are nearly identical in appearance to Latin letters). Dawid