I have to think about this a bit, but that may work. I just have to make sure no "undesirable" side effects occur. I certainly want to be able to search for a phrase and not have it match all the individual bits, but that should already work using the mechanism I already have in place.
Donna "Carl Austin" <carl.aus...@detica.com> wrote on 08/07/2009 10:50:08 AM: > [image removed] > > RE: Is there a way for me to handle a multiword synonym correctly? > > Carl Austin > > to: > > java-user > > 08/07/2009 10:50 AM > > Please respond to java-user > > I may be over simplifying here but in this case don't you just need to > use an analyzer that breaks the word "SAP.EM.FIN.AM" on full stops and > throws them out, so that it is indexed as terms "SAP" "EM" "FIN" "AM". > This is the same as it will index "SAP EM FIN AM" as long as you break > on whitespace too. I.E SimpleAnalyzer (runs of letter characters are > tokens) > > Then the query for "SAP EM FIN AM" will match both. > > Carl > > > -----Original Message----- > From: Donna L Gresh [mailto:gr...@us.ibm.com] > Sent: 07 August 2009 15:35 > To: java-user@lucene.apache.org > Subject: Is there a way for me to handle a multiword synonym correctly? > > I saw some discussion on the board but I'm not sure I've got quite the > same problem. As an example, I have a query that might be a technical > skill: > > SAP EM FIN AM > > I would like that to match a document that has *either* SAP.EM.FIN.AM or > "SAP EM FIN AM" (in that order and all together, not spread out through > the document). > > The approach I had tried was at index time if I saw SAP.EM.FIN.AM I > would consider "SAP EM FIN AM" a synonym for it, using the Lucene in > Action example. Luke shows me that I have two terms in the index for > this > document: SAP.EM.FIN.AM and "SAP EM FIN AM" (one term). Thus it appears > differently in the index than if it had been organically found as just > the string of tokens, in which case there would be separate terms for > SAP, EM, and so on. > > At query time if I look for "SAP EM FIN AM" it is formed as a phrase > query with a slop of 0 which does *not* match the one term version "SAP > EM FIN AM". (For that matter a simple boolean query doesn't find it > either) Luke confirms the fact that the phrase query does not find my > synonym term. The query "SAP EM FIN AM" finds *only* documents that > originally had those separated tokens in them. > > Is there a way to handle this situation such that at index time I can > turn SAP.EM.FIN.AM into something that will be found with a query for > "SAP EM FIN AM"? > > Thanks for any pointers > > Donna > > > > This message should be regarded as confidential. If you have > received this email in error please notify the sender and destroy it > immediately. > Statements of intent shall only become binding when confirmed in > hard copy by an authorised signatory. The contents of this email > may relate to dealings with other companies within the Detica > Limited group of companies. > > Detica Limited is registered in England under No: 1337451. > > Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >