On Fri, Apr 1, 2011 at 7:58 AM, Dawid Weiss <[email protected]> wrote:
> Mike, can you remember what ordering is required for
> add(CharSequence)? I see it requires INPUT_TYPE.BYTE4
>
> assert fst.getInputType() == FST.INPUT_TYPE.BYTE4;
>
> but this would imply the order of full unicode codepoints on the
> input? Is this what String comparators do by default (I doubt, but
> wanted to check if you know first).
>

(sorry not mike, but) you are right, String.compareTo() compares in
utf-16 order by default. this is not consistent with the order the FST
builder expects (utf8/utf32 order)

So if you are going to order the terms before passing them to Builder,
you should either use a utf-16-in-utf-8-order comparator* (or simply
use codePointAt and friends and compare those ints, probably
slower...)

different ways of impl'ing the comparator below:
* http://icu-project.org/docs/papers/utf16_code_point_order.html
* http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to