On Dec 8, 2005, at 10:15 AM, Beady Geraghty wrote:
Since someone suggested hyphen, the next requestion
is underscore. I can see more and more of these requests.
Also, people might like to search for "/usr/include/wchar.h" (hence,
the slash) and apostrophe etc. There really isn't a set of
requirements
upfront. In fact people wants EVERYTHING if
they could, and full flexibility (even though they don't know
whether they will need it or not.)
So it appears that doing something "general" is better.
Keep in mind that even if you tokenize "/usr/include" as "usr" and
"include" that a query for "/usr/include" will still match if it is
analyzed in a compatible manner. It is important to realize that
just because bits and pieces get eaten during analysis that it
doesn't make things unfindable. Sure, if someone literally wants to
search for "/" only then it is important to keep upfront, but
generally this is not the case.
I have been using StandardAnalyzer for the things you mentioned, like
email address, and www.google.com or i.b.m. Those are good things
for me to have. Since I've used it now, if I change it now, I
might break
other
people's dependencies.
WhitespaceTokenizer followed by a LowerCaseFilter also catches the
cases you mention.
Again, lots of tests to assert what the users expect is a strong
recommendation I have with tokenization.
If you do have a list of pitfalls from javaCC, could you point me
to it,
that way, I can think about some of the potential issues and decide
whether I should just abandon using javaCC ?
It all really depends. JavaCC is complex, that is my only
reservation for recommending it. If you can get by with simpler
analysis techniques, then I say go for those instead.
But, JavaCC is powerful. It simply isn't necessary for the bulk of
analysis cases I've come across. Works great for parsing query
expressions though.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]