On Dec 8, 2005, at 10:15 AM, Beady Geraghty wrote:
Since someone suggested hyphen, the next requestion
is underscore.  I can see more and more of these requests.
Also, people might like to search  for "/usr/include/wchar.h"  (hence,
the slash) and apostrophe etc. There really isn't a set of requirements
upfront. In fact people wants EVERYTHING if
they could, and full flexibility (even though they don't know
whether they will need it or not.)
So it appears that doing something "general" is better.

Keep in mind that even if you tokenize "/usr/include" as "usr" and "include" that a query for "/usr/include" will still match if it is analyzed in a compatible manner. It is important to realize that just because bits and pieces get eaten during analysis that it doesn't make things unfindable. Sure, if someone literally wants to search for "/" only then it is important to keep upfront, but generally this is not the case.

I have been using StandardAnalyzer for the things you mentioned, like
email address, and www.google.com or i.b.m.  Those are good things
for me to have. Since I've used it now, if I change it now, I might break
other
people's dependencies.

WhitespaceTokenizer followed by a LowerCaseFilter also catches the cases you mention.

Again, lots of tests to assert what the users expect is a strong recommendation I have with tokenization.

If you do have a list of pitfalls from javaCC, could you point me to it,
that way, I can think about some of the potential issues and decide
whether I should just abandon using javaCC ?

It all really depends. JavaCC is complex, that is my only reservation for recommending it. If you can get by with simpler analysis techniques, then I say go for those instead.

But, JavaCC is powerful. It simply isn't necessary for the bulk of analysis cases I've come across. Works great for parsing query expressions though.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to