On Fri, Apr 1, 2011 at 7:58 AM, Dawid Weiss <[email protected]> wrote: > Mike, can you remember what ordering is required for > add(CharSequence)? I see it requires INPUT_TYPE.BYTE4 > > assert fst.getInputType() == FST.INPUT_TYPE.BYTE4; > > but this would imply the order of full unicode codepoints on the > input? Is this what String comparators do by default (I doubt, but > wanted to check if you know first). >
(sorry not mike, but) you are right, String.compareTo() compares in utf-16 order by default. this is not consistent with the order the FST builder expects (utf8/utf32 order) So if you are going to order the terms before passing them to Builder, you should either use a utf-16-in-utf-8-order comparator* (or simply use codePointAt and friends and compare those ints, probably slower...) different ways of impl'ing the comparator below: * http://icu-project.org/docs/papers/utf16_code_point_order.html * http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
