Thanks Dawid, that was it. I'm now using an empty stoptags set and I'm
seeing all the expected tokens.
Jerome
From: Dawid Weiss
To: java-user@lucene.apache.org,
Date: 01/18/2013 02:52 PM
Subject: Re: Japanese analyzer
Jerome,
Some of the tokens are removed bec
Jerome,
Some of the tokens are removed because their part of speech tags are
in the stoptags file? That's my guess at least -- you can always try
to copy/paste Japanese analyzer and change the token stream
components:
protected TokenStreamComponents createComponents(String fieldName,
R
Thanks for your answer.
No those words are not part of the stop word file (I'm using the one that
comes with the Japanese analyzer in lucene-kuromoji-3.6.1.jar.
My Japanese contact told me that the first sentence means "I am Japanese"
and the second one is a unit of length.
Hi,
I just translated these words, using google translate look like Japanese
I [
Can you check if these words are in your stopword file.
if these words exits in your stop word file than you will not get them in
token stream.
-Swapnil
On Fri, Jan 18, 2013 at 6:58 PM, Jerome Lanneluc wrote:
>
I have searched this mailing list but I could not find the answer to the
following problem.
I'm using the 3.6.1 Japanese analyzer and it seems that when tokenizing
some Japanese words, some characters are ignored and they are not returned
in the tokens.
In the attached example,
the outp