I did intend to ignore all the spaces, so that's not the problem. Here's the tokenization chain in customAnalyser class, extending Analyser @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { NGramTokenizer src = new NGramTokenizer(matchVersion, reader); // My NGramTokenizer
TokenStream tok = new LowerCaseFilter(matchVersion, src); return new TokenStreamComponents(src, tok); } NGramTokenizer's incrementToken() method. @Override public boolean incrementToken() throws IOException { clearAttributes(); char[] termBuffer = termAtt.buffer(); termAtt.setLength(GRAM_SIZE); startOffset++; // Values for offset attribute offsetAtt.setOffset(startOffset, startOffset+ GRAM_SIZE-1); do { termBuffer[0] = termBuffer[1]; // Shift characters to left termBuffer[1] = termBuffer[2]; // Get next non-whitespace character int c = ' '; while(Character.isWhitespace(c)) { if(position >= dataLength) // Read in buffer, if position gets out of bound { if(charUtils.fill(iobuffer, input)) { dataLength = iobuffer.getLength(); position = 0; } else // EOF return false; } c = charUtils.codePointAt(iobuffer.getBuffer(), position); // Get next character position++; } Character.toChars(c, termBuffer, GRAM_SIZE-1); // System.out.print("'"+termBuffer[0]+termBuffer[1]+termBuffer[2]+"', "); // This is how I got the output in the last email } while(Character.getNumericValue(termBuffer[0]) == -1); return true; } 2013/1/14 Ian Lea <ian....@gmail.com> > In fact I see you are ignoring all spaces between words. Maybe that's > deliberate. Break it down into the smallest possible complete code > sample that shows the problem and post that. > > > -- > Ian. > > > On Mon, Jan 14, 2013 at 11:02 AM, Ian Lea <ian....@gmail.com> wrote: > > It won't be IndexWriter or IndexWriterConfig. What exactly does your > > analyzer do - what is the full chain of tokenization? Are you saying > > that ':)a' and ')an' are not indexed? Surely that is correct given > > your input with a space after the :). And before as well so 's:)', is > > also suspect. > > > > -- > > Ian. > > > > > > On Mon, Jan 14, 2013 at 7:42 AM, Hankyu Kim <gksr...@gmail.com> wrote: > >> I'm working with Lucene 4.0 and I didn't use lucene's QueryParser, so > >> setAllowLeadingWildcard() is irrelevant. > >> I also realised the issue wasn't with querying, but it was indexing > whihch > >> left the terms with leading special character out. > >> > >> My goal was to do a fuzzymatch by creating a trigram index. The idea is > to > >> tokenize the documents into trigrams, not by words during indexing and > >> searching so lucene can search for part of a word or phrase. > >> > >> Say the original text in the document said : "Sample text with special > >> characters :) and such" > >> It's tokenized into > >> 'sam', 'amp', 'mpl', 'ple', 'let', 'ete', 'tex', 'ext', 'xtw', 'twi', > >> 'wit', 'ith', 'ths', 'hsp', 'spe', 'pec', 'eci', 'cia', 'ial', 'alc', > >> 'lch', 'cha', 'har', 'ara', 'rac', 'act', 'cte', 'ter', 'ers', 'rs:', > >> 's:)', ':)a', ')an', 'and', 'nds', 'dsu', 'suc', 'uch'. > >> The above is output from my tokenizer so there's nothing wrong with > >> creating trigrmas. However, when I check the index with lukeall, all the > >> other trigrams are indexed correctly except for the terms ':)a' and > ')an'. > >> Since the missing indexes are related to lucene's special characters, I > >> don't think it's got to do with my custom code. > >> > >> I only changed analyser in the IndexFiles.java from demo to index the > file. > >> Honestly, I can't locate even the exact class in which the problem is > >> caused. I'm only guessing IndexWriterConfig or IndexWriter is discarding > >> the terms with leading special characters. > >> > >> I hope the above infromation helps. > >> > >> 2013/1/11 Ian Lea <ian....@gmail.com> > >> > >>> QueryParser has a setAllowLeadingWildcard() method. Could that be > >>> relevant? > >>> > >>> What version of lucene? Can you post some simple examples of what > >>> does/doesn't work? Post the smallest possible, but complete, code that > >>> demonstrates the problem? > >>> > >>> > >>> With any question that mentions a custom version of something, that > >>> custom version has to be the prime suspect for any problems. > >>> > >>> > >>> -- > >>> Ian. > >>> > >>> > >>> On Thu, Jan 10, 2013 at 12:08 PM, Hankyu Kim <gksr...@gmail.com> > wrote: > >>> > Hi. > >>> > > >>> > I've created a custom analyzer that treats special characters just > like > >>> any > >>> > other. The index works fine all the time even when the query includes > >>> > special characters, except when the special characters come to the > >>> begining > >>> > of the query. > >>> > > >>> > I'm using spanTermQuery and wildCardQuery, and they both seem to > suffer > >>> the > >>> > same issue with queries begining with special characters. Is it a > >>> > limitation of Lucene or am I missing something? > >>> > > >>> > Thanks > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >