Apache commons codec library has double metaphone algorithm. I tried a series of experiments around storing the double metaphone representations of strings in the index itself, and searching using doublemetaphone version of search terms when the field I am searching against is stored as double metaphone. This works very well. For my test rig, I added 4 variants of a field to the document. The four variants were: 1) name-tokenized-doublemetaphone 2) name-tokenized 3)name-untokenized-doublemetaphone 4)name-untokenized Here is the code where I wrote added the 4 variants to the index: private void addProductNamesToDoc(Document poiDocument, IdentityType id) { DoubleMetaphone dm = new DoubleMetaphone(); dm.setMaxCodeLen(100); for(Object name: id.getNames().getPOIName()){ //for each name in list of names. Name can be "SCHAAD FAMILY ALMONDS" for example if(log.isDebugEnabled())log.debug(((POINameType)name).getText()); if(null != ((POINameType)name).getText()){ String[] splits = ((POINameType)name).getText().split("\\s"); //tokenize manually. (gosh, I thought the analyser would do this) //add tokenized double metaphone and plain tokenized variants of name for(String component:splits){ poiDocument.add(new Field("name-tokenized-doublemetaphone",dm.doubleMetaphone(component), Field.Store.YES, Field.Index.ANALYZED)); poiDocument.add(new Field("name-tokenized",component, Field.Store.YES, Field.Index.ANALYZED)); } //add untokenized double metaphone and untokenized plain poiDocument.add(new Field("name-untokenized-doublemetaphone",dm.doubleMetaphone(((POINameTyp e)name).getText()), Field.Store.YES, Field.Index.ANALYZED)); poiDocument.add(new Field("name-untokenized",((POINameType)name).getText(), Field.Store.YES, Field.Index.ANALYZED)); } } } Results of testing misspelled terms with PhraseQuery show that only name-tokenized-doublemetaphone can tolerate misspellings.So this seems to be a nice and efficient way to allow inputs that are wildly misspelled. Can someone explain to me exactly what Field.Store.YES and Field.Index.ANALYZED do? Should I tune these values?
Geoff Hendrey Software Architect deCarta Four North Second Street, Suite 950 San Jose, CA 95113 office 408.625.3522 www.decarta.com <blocked::http://www.decarta.com>