Actually, my mind kind of overloaded when I read the following from the (2.1) javadoc....
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. - Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. - Recognizes email addresses and internet hostnames as one token. All of which is just fine and I'm glad someone else wrote the grammar, but I'm finding more and more that I'm constructing my own analyzers/tokenizers instead since I'm in a specialized space, often massaging the input stream outside the analyzer. For instance, is O'Hara best tokenized as one, two, or three tokens? In genealogy, it's best tokenized as ohara, which none of the standard analyzers would treat "properly". As in "just the way I want it to be treated" <G>.... Best Erick On 6/7/07, Michael D. Curtin <[EMAIL PROTECTED]> wrote:
Doron Cohen wrote: >>From the StandardAnalyzer javacc grammar : > // floating point, serial, model numbers, ip addresses, etc. > // every other segment must have at least one digit > <NUM: (<ALPHANUM> <P> <HAS_DIGIT> .... etc. > <#P: ("_"|"-"|"/"|"."|",") > > My understanding of this: a non-whitespace sequence is broken > at either of these 5 chars > _ - / . , > unless the part that follows part has a digit, in which case > it is assumed to be (part of) a serial no., model, etc. Weird. The definition seems to allow expressions of the form A-B-C-D-E-..., where - "-" can be one of the five characters you mentioned - the A, B, C, ... are alphanumeric pseudo-words - A, C, E, ... or B, D, F, ... must have digits, i.e. alternating digit components So "A-1-B-2" and "1-A-2-B" would be kept as single tokens, but "A-B-1-2" would not. Seems more than a little hokey, but I suppose it's been working for a long time, for the most part. --MDC --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]