There is often the possibility to put another tokenizer in the chain to create a variant analyzer. This NOT very hard at all in either Lucene or ElasticSearch. Extra tokenizers can often be used to tweak the overall processing to add a late tokenization to overcome an overlooked tokenization (break on colon would be a simple example). Adding a tokenizer before others can change a token that seem incorrectly processed into one that is done how you like.
Trejkaz, I haven't tried to use ICU yet, but what I understand, I think you'll find that ICU is more in agreement with your views and embraces the idea of refining the tokenization etc. as needed, not relying on the curios (and often flawed) choices of some design committee somewhere. [ICU] > -----Original Message----- > ... no specialisation for straight Roman script, but I guess it could > always be added. That would be one of the main points of the whole ICU infrastructure. -Paul