Why do the Japanese analyser FST files change every release?

Trejkaz Thu, 06 Aug 2015 18:05:45 -0700

I have recently done updates from Lucene 3.6 to 4.x and 4.x to 5.2.

During this process, I noticed that the FST used by the Japanese
analyser (AKA Kuromoji) was changing between releases. As I fear
breakages in backwards compatibility, I worried that the dictionary
had changed, so I wrote a little program to read it in and print the
words out in order.


What I find is that in all three releases, the list of words is
exactly the same - even though the files have changed subtly from
release to release.

What's up with that? I can think of a few possibilities:

(a) the dictionary _has_ actually changed, and merely printing the
list of words was not enough (e.g., the parts of speech changed)

(b) the dictionary hasn't changed, but the files change when the FST
format changes

(c) the dictionary hasn't changed, but the files change because
they're built on demand every time Lucene is built and there is
something non-deterministic about the process (e.g. something is using
a HashMap internally.)

I'm hoping that it's (b), but does anybody know?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Why do the Japanese analyser FST files change every release?

Reply via email to