Re: Indexing puncuation and symbols

John Byrne Mon, 01 Oct 2007 06:55:48 -0700

Whitespace analyzer does preserve those symbols, but not as tokens. Itsimply leaves them attached to the original term.

As an example of what I'm talking about, consider a document thatcontains (without the quotes) "foo, ".

Now, using WhitespaceAnalyzer, I could only get that document bysearching for "foo,". Using StandardAnalyzer or any analyzer thatremoves punctuation, I could only find it by searching for "foo".

I want an analyzer that will allow me to find it if I build a phrasequery with the term "foo" followed immediately by ",". After all, thecomma may be relevant to the search, but is definitely not part of theword.

Extending StandardAnalyer is what I had in mind, but I don't know whereto start. I also wonder why no-one seems to have done it before- itmakes me suspect that there's some reason I haven't seen yet that makesit impossible ot impractical.




Karl Wettin wrote:


1 okt 2007 kl. 15.33 skrev John Byrne:

Has anyone written an analyzer that preserves puncuation and
synmbols ("£", "$", "%" etc.) as tokens?


WhitespaceAnalyzer?

You could also extend the lexical rules of StandardAnalyzer.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing puncuation and symbols

Reply via email to