Don't ever do this: String word = new String(ref.bytes);
This has following problems: - ignores character set!!! (in general: never ever use new String(byte[]) without specifying the 2nd charset parameter!). byte[] != String. Depending on the default charset on your computer this would return bullshit - ignores length - ignores offset Use the following code to convert a UTF-8 encoded BytesRef to a String: String word = ref.utf8ToString() Thanks :-) P.S.: I posted this here because I want to prevent that the code you posted gets used by anybody else ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Mansour Al Akeel [mailto:mansour.alak...@gmail.com] > Sent: Saturday, June 23, 2012 12:26 AM > To: java-user@lucene.apache.org > Subject: StandardTokenizer and split tokens > > Hello all, > > I am tying to write a simple autosuggest functionality. I was looking at some > auto suggest code, and came over this post > http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion- > suggestions-in-lucene > I have been stuck with the some strange words, trying to see how they are > generated. Here's the Anayzer: > > public class AutoCompleteAnalyzer extends Analyzer { > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream result = null; > result = new StandardTokenizer(Version.LUCENE_36, reader); > result = new EdgeNGramTokenFilter(result, > EdgeNGramTokenFilter.Side.FRONT, > 1, 20); > return result; > } > } > > And this is the relevant method that does the indexing. It's being called with > reindexOn("title"); > > private void reindexOn(String keyword) throws CorruptIndexException, > IOException { > log.info("indexing on " + keyword); > Analyzer analyzer = new AutoCompleteAnalyzer(); > IndexWriterConfig config = new > IndexWriterConfig(Version.LUCENE_36, > analyzer); > IndexWriter analyticalWriter = new > IndexWriter(suggestIndexDirectory, config); > analyticalWriter.commit(); // needed to create the initiale > index > IndexReader indexReader = > IndexReader.open(productsIndexDirectory); > Map<String, Integer> wordsMap = new HashMap<String, > Integer>(); > LuceneDictionary dict = new LuceneDictionary(indexReader, > keyword); > BytesRefIterator iter = dict.getWordsIterator(); > BytesRef ref = null; > while ((ref = iter.next()) != null) { > String word = new String(ref.bytes); > int len = word.length(); > if (len < 3) { > continue; > } > if (wordsMap.containsKey(word)) { > String msg = "Word " + word + " Already > Exists"; > throw new IllegalStateException(msg); > } > wordsMap.put(word, indexReader.docFreq(new > Term(keyword, word))); > } > > for (String word : wordsMap.keySet()) { > Document doc = new Document(); > Field field = null; > field = new Field(SOURCE_WORD_FIELD, word, > Field.Store.YES, Field.Index.NOT_ANALYZED); > doc.add(field); > field = new Field(GRAMMED_WORDS_FIELD, word, > Field.Store.YES, Field.Index.ANALYZED); > doc.add(field); > String count = Integer.toString(wordsMap.get(word)); > field = new Field(COUNT_FIELD, count, Field.Store.NO, > Field.Index.NOT_ANALYZED); // count > doc.add(field); > analyticalWriter.addDocument(doc); > } > analyticalWriter.commit(); > analyticalWriter.close(); > indexReader.close(); > } > > private static final String GRAMMED_WORDS_FIELD = "words"; > private static final String SOURCE_WORD_FIELD = "sourceWord"; > private static final String COUNT_FIELD = "count"; > > And now, my unit testing : > > @BeforeClass > public static void setUp() throws CorruptIndexException, IOException { > String idxFileName = "myIndexDirectory"; > Indexer indexer = new Indexer(idxFileName); > indexer.addDoc("Apache Lucene in Action"); > indexer.addDoc("Lord of the Rings"); > indexer.addDoc("Apache Solr in Action"); > indexer.addDoc("apples and Oranges"); > indexer.addDoc("apple iphone"); > indexer.reindexKeywords(); > search = new SearchEngine(idxFileName); > } > > The strange part, is looking under the index I found there are sourceWords > (lordne, applee, solres ). I understand that the ngram will result in parts of each > word. Ex: > > l > lo > lor > lord > > But of these go into one field, but what about "lorden" and "solres" > ?? I checked the docs for this, and looked into Jira, but didn't find relevant info. > Is there something I am missing ?? > > I understand there could be easier ways to create this functionality > (http://wiki.apache.org/lucene-java/SpellChecker), but I like to resolve this > issue, and to understand if I am doing something wrong. > > Thank you in advance. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org