I just found the cause of error and you were right about my code being the source. I used "Character.getNumericValue(termBuffer[0]) == -1" to test if termBuffer[0] is equal to null, but apparently the special characters return -1 as well when given as parameter.
Thank you for your help. 2013/1/14 Hankyu Kim <gksr...@gmail.com> > I did intend to ignore all the spaces, so that's not the problem. > > Here's the tokenization chain in customAnalyser class, extending Analyser > @Override > protected TokenStreamComponents createComponents(String fieldName, > Reader reader) { > NGramTokenizer src = new NGramTokenizer(matchVersion, reader); // > My NGramTokenizer > > TokenStream tok = new LowerCaseFilter(matchVersion, src); > return new TokenStreamComponents(src, tok); > } > > NGramTokenizer's incrementToken() method. > @Override > public boolean incrementToken() throws IOException > { > clearAttributes(); > char[] termBuffer = termAtt.buffer(); > termAtt.setLength(GRAM_SIZE); > > startOffset++; // Values for offset > attribute > offsetAtt.setOffset(startOffset, startOffset+ GRAM_SIZE-1); > > do > { > termBuffer[0] = termBuffer[1]; // Shift characters > to left > termBuffer[1] = termBuffer[2]; > > // Get next non-whitespace character > int c = ' '; > while(Character.isWhitespace(c)) > { > if(position >= dataLength) // Read in buffer, if position > gets out of bound > { > if(charUtils.fill(iobuffer, input)) > { > dataLength = iobuffer.getLength(); > position = 0; > } > else // EOF > return false; > } > > c = charUtils.codePointAt(iobuffer.getBuffer(), > position); // Get next character > position++; > } > > Character.toChars(c, termBuffer, GRAM_SIZE-1); > // > System.out.print("'"+termBuffer[0]+termBuffer[1]+termBuffer[2]+"', "); > // This is how I got the output in the last email > > } > while(Character.getNumericValue(termBuffer[0]) == -1); > > return true; > > } > > 2013/1/14 Ian Lea <ian....@gmail.com> > >> In fact I see you are ignoring all spaces between words. Maybe that's >> deliberate. Break it down into the smallest possible complete code >> sample that shows the problem and post that. >> >> >> -- >> Ian. >> >> >> On Mon, Jan 14, 2013 at 11:02 AM, Ian Lea <ian....@gmail.com> wrote: >> > It won't be IndexWriter or IndexWriterConfig. What exactly does your >> > analyzer do - what is the full chain of tokenization? Are you saying >> > that ':)a' and ')an' are not indexed? Surely that is correct given >> > your input with a space after the :). And before as well so 's:)', is >> > also suspect. >> > >> > -- >> > Ian. >> > >> > >> > On Mon, Jan 14, 2013 at 7:42 AM, Hankyu Kim <gksr...@gmail.com> wrote: >> >> I'm working with Lucene 4.0 and I didn't use lucene's QueryParser, so >> >> setAllowLeadingWildcard() is irrelevant. >> >> I also realised the issue wasn't with querying, but it was indexing >> whihch >> >> left the terms with leading special character out. >> >> >> >> My goal was to do a fuzzymatch by creating a trigram index. The idea >> is to >> >> tokenize the documents into trigrams, not by words during indexing and >> >> searching so lucene can search for part of a word or phrase. >> >> >> >> Say the original text in the document said : "Sample text with special >> >> characters :) and such" >> >> It's tokenized into >> >> 'sam', 'amp', 'mpl', 'ple', 'let', 'ete', 'tex', 'ext', 'xtw', 'twi', >> >> 'wit', 'ith', 'ths', 'hsp', 'spe', 'pec', 'eci', 'cia', 'ial', 'alc', >> >> 'lch', 'cha', 'har', 'ara', 'rac', 'act', 'cte', 'ter', 'ers', 'rs:', >> >> 's:)', ':)a', ')an', 'and', 'nds', 'dsu', 'suc', 'uch'. >> >> The above is output from my tokenizer so there's nothing wrong with >> >> creating trigrmas. However, when I check the index with lukeall, all >> the >> >> other trigrams are indexed correctly except for the terms ':)a' and >> ')an'. >> >> Since the missing indexes are related to lucene's special characters, I >> >> don't think it's got to do with my custom code. >> >> >> >> I only changed analyser in the IndexFiles.java from demo to index the >> file. >> >> Honestly, I can't locate even the exact class in which the problem is >> >> caused. I'm only guessing IndexWriterConfig or IndexWriter is >> discarding >> >> the terms with leading special characters. >> >> >> >> I hope the above infromation helps. >> >> >> >> 2013/1/11 Ian Lea <ian....@gmail.com> >> >> >> >>> QueryParser has a setAllowLeadingWildcard() method. Could that be >> >>> relevant? >> >>> >> >>> What version of lucene? Can you post some simple examples of what >> >>> does/doesn't work? Post the smallest possible, but complete, code that >> >>> demonstrates the problem? >> >>> >> >>> >> >>> With any question that mentions a custom version of something, that >> >>> custom version has to be the prime suspect for any problems. >> >>> >> >>> >> >>> -- >> >>> Ian. >> >>> >> >>> >> >>> On Thu, Jan 10, 2013 at 12:08 PM, Hankyu Kim <gksr...@gmail.com> >> wrote: >> >>> > Hi. >> >>> > >> >>> > I've created a custom analyzer that treats special characters just >> like >> >>> any >> >>> > other. The index works fine all the time even when the query >> includes >> >>> > special characters, except when the special characters come to the >> >>> begining >> >>> > of the query. >> >>> > >> >>> > I'm using spanTermQuery and wildCardQuery, and they both seem to >> suffer >> >>> the >> >>> > same issue with queries begining with special characters. Is it a >> >>> > limitation of Lucene or am I missing something? >> >>> > >> >>> > Thanks >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>> >> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >