On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <saisantosh...@gmail.com> wrote: > DoesLucene StandardAnalyzer work for all the languagues for tokenizing before > indexing (since we are using java, I think the content is converted to UTF-8 > before tokenizing/indeing)?
No. There are multiple cases where it chooses not to break something which it should break. Some of these cases even result in undesirable behaviour for English, so I would be surprised if there were even a single language which it handles acceptably. It does follow "Unicode standards" for how to tokenise text, but these standards were written by people who didn't quite know what they were doing so it's really just passing the buck. I don't think Lucene should have chosen to follow that standard in the first place, because it rarely (if ever) gives acceptable results. The worst examples for English, at least for us, were that it does not break on colon (:) or underscore (_). Colon was explained by some languages using it like an apostrophe. Personally I think you should break on an apostrophe as well, so I'm not really happy with this reasoning, but OK. Underscore was completely baffling to me so I asked someone at Unicode about it. They explained that it was because it was "used by programmers to separate words in identifiers". This explanation is exactly as stupid as it sounds and I hope they will realise their stupidity some day. > or do we need to use special analyzers for each of the language. I do think that StandardTokenizer at least can form a good base for an analyser. You just have to add a ton of filters to fix each additional case you find where people don't like it. For instance, it returns runs of Katakana as a single token, but if you did that, people wouldn't find what they are searching for, so you make a filter to split that back out into multiple tokens. It would help if there were a single, core-maintained analyser for "StandardAnalyzer with all the things people hate fixed"... but I don't know if anyone is interested in maintaining it. > In this case, if a document has a mixed case ( english + > Japanese), what analyzer should we use and how can we figure it out > dynamically before indexing? Some language detection libraries will give you back the fragments in the text and tell you which language is used for each fragment, so that is totally a viable option as well. You'd just make your own analyser which concatenates the results. > Also, while searching if the query text contains (both english and > Japanese), how does this work? Any criteria in choosing the analyzers? I guess you could either ask the user what language they're searching in or look at what characters are in their query and decide which language(s) it matches and build the query from there. It might match multiple... TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org