RE: StandardTokenizer and split tokens

Uwe Schindler Sat, 23 Jun 2012 00:07:22 -0700

Don't ever do this:

String word = new String(ref.bytes);


This has following problems:
- ignores character set!!! (in general: never ever use new String(byte[])
without specifying the 2nd charset parameter!). byte[] != String. Depending
on the default charset on your computer this would return bullshit
- ignores length
- ignores offset

Use the following code to convert a UTF-8 encoded BytesRef to a String:

String word = ref.utf8ToString()

Thanks :-)

P.S.: I posted this here because I want to prevent that the code you posted
gets used by anybody else

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Mansour Al Akeel [mailto:mansour.alak...@gmail.com]
> Sent: Saturday, June 23, 2012 12:26 AM
> To: java-user@lucene.apache.org
> Subject: StandardTokenizer and split tokens
> 
> Hello all,
> 
> I am tying to write a simple autosuggest functionality. I was looking at
some
> auto suggest code, and came over this post
> http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-
> suggestions-in-lucene
> I have been stuck with the some strange words, trying to see how they are
> generated. Here's the Anayzer:
> 
> public class AutoCompleteAnalyzer extends Analyzer {
>       public TokenStream tokenStream(String fieldName, Reader reader) {
>               TokenStream result = null;
>               result = new StandardTokenizer(Version.LUCENE_36, reader);
>               result = new EdgeNGramTokenFilter(result,
>       EdgeNGramTokenFilter.Side.FRONT,
> 1, 20);
>               return result;
>       }
> }
> 
> And this is the relevant method that does the indexing. It's being called
with
> reindexOn("title");
> 
> private void reindexOn(String keyword) throws CorruptIndexException,
>                       IOException {
>               log.info("indexing on " + keyword);
>               Analyzer analyzer = new AutoCompleteAnalyzer();
>               IndexWriterConfig config = new
> IndexWriterConfig(Version.LUCENE_36,
>       analyzer);
>               IndexWriter analyticalWriter = new
> IndexWriter(suggestIndexDirectory, config);
>               analyticalWriter.commit(); // needed to create the initiale
> index
>               IndexReader indexReader =
> IndexReader.open(productsIndexDirectory);
>               Map<String, Integer> wordsMap = new HashMap<String,
> Integer>();
>               LuceneDictionary dict = new LuceneDictionary(indexReader,
> keyword);
>               BytesRefIterator iter = dict.getWordsIterator();
>               BytesRef ref = null;
>               while ((ref = iter.next()) != null) {
>                       String word = new String(ref.bytes);
>                       int len = word.length();
>                       if (len < 3) {
>                               continue;
>                       }
>                       if (wordsMap.containsKey(word)) {
>                               String msg = "Word " + word + " Already
> Exists";
>                               throw new IllegalStateException(msg);
>                       }
>                       wordsMap.put(word, indexReader.docFreq(new
> Term(keyword, word)));
>               }
> 
>               for (String word : wordsMap.keySet()) {
>                       Document doc = new Document();
>                       Field field = null;
>                       field = new Field(SOURCE_WORD_FIELD, word,
> Field.Store.YES, Field.Index.NOT_ANALYZED);
>                       doc.add(field);
>                       field = new Field(GRAMMED_WORDS_FIELD, word,
> Field.Store.YES,      Field.Index.ANALYZED);
>                       doc.add(field);
>                       String count = Integer.toString(wordsMap.get(word));
>                       field = new Field(COUNT_FIELD, count,
Field.Store.NO,
> Field.Index.NOT_ANALYZED); // count
>                       doc.add(field);
>                       analyticalWriter.addDocument(doc);
>               }
>               analyticalWriter.commit();
>               analyticalWriter.close();
>               indexReader.close();
>       }
> 
>       private static final String GRAMMED_WORDS_FIELD = "words";
>       private static final String SOURCE_WORD_FIELD = "sourceWord";
>       private static final String COUNT_FIELD = "count";
> 
> And now, my unit testing :
> 
>       @BeforeClass
>       public static void setUp() throws CorruptIndexException, IOException
{
>               String idxFileName = "myIndexDirectory";
>               Indexer indexer = new Indexer(idxFileName);
>               indexer.addDoc("Apache Lucene in Action");
>               indexer.addDoc("Lord of the Rings");
>               indexer.addDoc("Apache Solr in Action");
>               indexer.addDoc("apples and Oranges");
>               indexer.addDoc("apple iphone");
>               indexer.reindexKeywords();
>               search = new SearchEngine(idxFileName);
>       }
> 
> The strange part, is looking under the index I found there are sourceWords
> (lordne, applee, solres ). I understand that the ngram will result in
parts of each
> word. Ex:
> 
> l
> lo
> lor
> lord
> 
> But of these go into one field, but what about "lorden" and "solres"
> ?? I checked the docs for this, and looked into Jira, but didn't find
relevant info.
> Is there something I am missing ??
> 
> I understand there could be easier ways to create this functionality
> (http://wiki.apache.org/lucene-java/SpellChecker), but I like to resolve
this
> issue, and to understand if I am doing something wrong.
> 
> Thank you in advance.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: StandardTokenizer and split tokens

Reply via email to