No problem. Glad you found the error. It's always in the custom code somewhere.
-- Ian. On Mon, Jan 14, 2013 at 12:04 PM, Hankyu Kim <gksr...@gmail.com> wrote: > I just found the cause of error and you were right about my code being the > source. > I used "Character.getNumericValue(termBuffer[0]) == -1" to test if > termBuffer[0] is equal to null, but apparently the special characters > return -1 as well when given as parameter. > > Thank you for your help. > > 2013/1/14 Hankyu Kim <gksr...@gmail.com> > >> I did intend to ignore all the spaces, so that's not the problem. >> >> Here's the tokenization chain in customAnalyser class, extending Analyser >> @Override >> protected TokenStreamComponents createComponents(String fieldName, >> Reader reader) { >> NGramTokenizer src = new NGramTokenizer(matchVersion, reader); // >> My NGramTokenizer >> >> TokenStream tok = new LowerCaseFilter(matchVersion, src); >> return new TokenStreamComponents(src, tok); >> } >> >> NGramTokenizer's incrementToken() method. >> @Override >> public boolean incrementToken() throws IOException >> { >> clearAttributes(); >> char[] termBuffer = termAtt.buffer(); >> termAtt.setLength(GRAM_SIZE); >> >> startOffset++; // Values for offset >> attribute >> offsetAtt.setOffset(startOffset, startOffset+ GRAM_SIZE-1); >> >> do >> { >> termBuffer[0] = termBuffer[1]; // Shift characters >> to left >> termBuffer[1] = termBuffer[2]; >> >> // Get next non-whitespace character >> int c = ' '; >> while(Character.isWhitespace(c)) >> { >> if(position >= dataLength) // Read in buffer, if position >> gets out of bound >> { >> if(charUtils.fill(iobuffer, input)) >> { >> dataLength = iobuffer.getLength(); >> position = 0; >> } >> else // EOF >> return false; >> } >> >> c = charUtils.codePointAt(iobuffer.getBuffer(), >> position); // Get next character >> position++; >> } >> >> Character.toChars(c, termBuffer, GRAM_SIZE-1); >> // >> System.out.print("'"+termBuffer[0]+termBuffer[1]+termBuffer[2]+"', "); >> // This is how I got the output in the last email >> >> } >> while(Character.getNumericValue(termBuffer[0]) == -1); >> >> return true; >> >> } >> >> 2013/1/14 Ian Lea <ian....@gmail.com> >> >>> In fact I see you are ignoring all spaces between words. Maybe that's >>> deliberate. Break it down into the smallest possible complete code >>> sample that shows the problem and post that. >>> >>> >>> -- >>> Ian. >>> >>> >>> On Mon, Jan 14, 2013 at 11:02 AM, Ian Lea <ian....@gmail.com> wrote: >>> > It won't be IndexWriter or IndexWriterConfig. What exactly does your >>> > analyzer do - what is the full chain of tokenization? Are you saying >>> > that ':)a' and ')an' are not indexed? Surely that is correct given >>> > your input with a space after the :). And before as well so 's:)', is >>> > also suspect. >>> > >>> > -- >>> > Ian. >>> > >>> > >>> > On Mon, Jan 14, 2013 at 7:42 AM, Hankyu Kim <gksr...@gmail.com> wrote: >>> >> I'm working with Lucene 4.0 and I didn't use lucene's QueryParser, so >>> >> setAllowLeadingWildcard() is irrelevant. >>> >> I also realised the issue wasn't with querying, but it was indexing >>> whihch >>> >> left the terms with leading special character out. >>> >> >>> >> My goal was to do a fuzzymatch by creating a trigram index. The idea >>> is to >>> >> tokenize the documents into trigrams, not by words during indexing and >>> >> searching so lucene can search for part of a word or phrase. >>> >> >>> >> Say the original text in the document said : "Sample text with special >>> >> characters :) and such" >>> >> It's tokenized into >>> >> 'sam', 'amp', 'mpl', 'ple', 'let', 'ete', 'tex', 'ext', 'xtw', 'twi', >>> >> 'wit', 'ith', 'ths', 'hsp', 'spe', 'pec', 'eci', 'cia', 'ial', 'alc', >>> >> 'lch', 'cha', 'har', 'ara', 'rac', 'act', 'cte', 'ter', 'ers', 'rs:', >>> >> 's:)', ':)a', ')an', 'and', 'nds', 'dsu', 'suc', 'uch'. >>> >> The above is output from my tokenizer so there's nothing wrong with >>> >> creating trigrmas. However, when I check the index with lukeall, all >>> the >>> >> other trigrams are indexed correctly except for the terms ':)a' and >>> ')an'. >>> >> Since the missing indexes are related to lucene's special characters, I >>> >> don't think it's got to do with my custom code. >>> >> >>> >> I only changed analyser in the IndexFiles.java from demo to index the >>> file. >>> >> Honestly, I can't locate even the exact class in which the problem is >>> >> caused. I'm only guessing IndexWriterConfig or IndexWriter is >>> discarding >>> >> the terms with leading special characters. >>> >> >>> >> I hope the above infromation helps. >>> >> >>> >> 2013/1/11 Ian Lea <ian....@gmail.com> >>> >> >>> >>> QueryParser has a setAllowLeadingWildcard() method. Could that be >>> >>> relevant? >>> >>> >>> >>> What version of lucene? Can you post some simple examples of what >>> >>> does/doesn't work? Post the smallest possible, but complete, code that >>> >>> demonstrates the problem? >>> >>> >>> >>> >>> >>> With any question that mentions a custom version of something, that >>> >>> custom version has to be the prime suspect for any problems. >>> >>> >>> >>> >>> >>> -- >>> >>> Ian. >>> >>> >>> >>> >>> >>> On Thu, Jan 10, 2013 at 12:08 PM, Hankyu Kim <gksr...@gmail.com> >>> wrote: >>> >>> > Hi. >>> >>> > >>> >>> > I've created a custom analyzer that treats special characters just >>> like >>> >>> any >>> >>> > other. The index works fine all the time even when the query >>> includes >>> >>> > special characters, except when the special characters come to the >>> >>> begining >>> >>> > of the query. >>> >>> > >>> >>> > I'm using spanTermQuery and wildCardQuery, and they both seem to >>> suffer >>> >>> the >>> >>> > same issue with queries begining with special characters. Is it a >>> >>> > limitation of Lucene or am I missing something? >>> >>> > >>> >>> > Thanks >>> >>> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org