I have recently done updates from Lucene 3.6 to 4.x and 4.x to 5.2. During this process, I noticed that the FST used by the Japanese analyser (AKA Kuromoji) was changing between releases. As I fear breakages in backwards compatibility, I worried that the dictionary had changed, so I wrote a little program to read it in and print the words out in order.
What I find is that in all three releases, the list of words is exactly the same - even though the files have changed subtly from release to release. What's up with that? I can think of a few possibilities: (a) the dictionary _has_ actually changed, and merely printing the list of words was not enough (e.g., the parts of speech changed) (b) the dictionary hasn't changed, but the files change when the FST format changes (c) the dictionary hasn't changed, but the files change because they're built on demand every time Lucene is built and there is something non-deterministic about the process (e.g. something is using a HashMap internally.) I'm hoping that it's (b), but does anybody know? TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org