Jerome, Some of the tokens are removed because their part of speech tags are in the stoptags file? That's my guess at least -- you can always try to copy/paste Japanese analyzer and change the token stream components:
protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new JapaneseTokenizer(reader, userDict, true, mode); TokenStream stream = new JapaneseBaseFormFilter(tokenizer); stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags); << this is the thing I was talking about. stream = new CJKWidthFilter(stream); stream = new StopFilter(matchVersion, stream, stopwords); stream = new JapaneseKatakanaStemFilter(stream); stream = new LowerCaseFilter(matchVersion, stream); return new TokenStreamComponents(tokenizer, stream); } Dawid On Fri, Jan 18, 2013 at 2:46 PM, Jerome Lanneluc <jerome_lanne...@fr.ibm.com> wrote: > Thanks for your answer. > > No those words are not part of the stop word file (I'm using the one that > comes with the Japanese analyzer in lucene-kuromoji-3.6.1.jar. > > My Japanese contact told me that the first sentence means "I am Japanese" > and the second one is a unit of length. > > Jerome > > > > From: Swapnil Patil <ping.swap...@gmail.com> > To: java-user@lucene.apache.org, > Date: 01/18/2013 02:33 PM > Subject: Re: Japanese analyzer > > > > Hi, > > I just translated these words, using google translate look like Japanese > I [ > Can you check if these words are in your stopword file. > if these words exits in your stop word file than you will not get them in > token stream. > > -Swapnil > > On Fri, Jan 18, 2013 at 6:58 PM, Jerome Lanneluc > <jerome_lanne...@fr.ibm.com >> wrote: > >> [私 日本人 > > > > Sauf indication contraire ci-dessus:/ Unless stated otherwise above: > Compagnie IBM France > Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex > RCS Nanterre 552 118 465 > Forme Sociale : S.A.S. > Capital Social : 653.242.306,20 € > SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org