Re: Japanese analyzer

2013-01-18 Thread Jerome Lanneluc
Thanks Dawid, that was it. I'm now using an empty stoptags set and I'm seeing all the expected tokens. Jerome From: Dawid Weiss To: java-user@lucene.apache.org, Date: 01/18/2013 02:52 PM Subject: Re: Japanese analyzer Jerome, Some of the tokens are removed bec

Re: Japanese analyzer

2013-01-18 Thread Dawid Weiss
Jerome, Some of the tokens are removed because their part of speech tags are in the stoptags file? That's my guess at least -- you can always try to copy/paste Japanese analyzer and change the token stream components: protected TokenStreamComponents createComponents(String fieldName, R

Re: Japanese analyzer

2013-01-18 Thread Jerome Lanneluc
Thanks for your answer. No those words are not part of the stop word file (I'm using the one that comes with the Japanese analyzer in lucene-kuromoji-3.6.1.jar. My Japanese contact told me that the first sentence means "I am Japanese" and the second one is a unit of length.

Re: Japanese analyzer

2013-01-18 Thread Swapnil Patil
Hi, I just translated these words, using google translate look like Japanese I [ Can you check if these words are in your stopword file. if these words exits in your stop word file than you will not get them in token stream. -Swapnil On Fri, Jan 18, 2013 at 6:58 PM, Jerome Lanneluc wrote: >

Japanese analyzer

2013-01-18 Thread Jerome Lanneluc
I have searched this mailing list but I could not find the answer to the following problem. I'm using the 3.6.1 Japanese analyzer and it seems that when tokenizing some Japanese words, some characters are ignored and they are not returned in the tokens. In the attached example, the outp