I’m working on an encoding of numbers / data into indexed terms. In the
past I limited the encoding to ASCII but now I’m doing it at a more
raw/byte level. Do I have to be aware of UTF8 / sorting issues when I do
this? I noticed the following code in NumericUtils.java, line 186:
while (nChars > 0) {
// Store 7 bits per byte for compatibility
// with UTF-8 encoding of terms
bytes.bytes[nChars--] = (byte)(sortableBits & 0x7f);
sortableBits >>>= 7;
}
It’s the comment more than anything that has my attention. Do I have to
limit my bytes to only the low 7 bits? If so, why? I’ve already written a
bunch of code that generates the terms without consideration for this, and
I think a bug I’m looking at could be related to this.
~ David
p.s. sorry to be CC’ing some folks directly but the mailing list is having
problems