Doron Cohen wrote:
From the StandardAnalyzer javacc grammar :
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
<NUM: (<ALPHANUM> <P> <HAS_DIGIT> .... etc.
<#P: ("_"|"-"|"/"|"."|",") >
My understanding of this: a non-whitespace sequence is broken
at either of these 5 chars
_ - / . ,
unless the part that follows part has a digit, in which case
it is assumed to be (part of) a serial no., model, etc.
Weird. The definition seems to allow expressions of the form
A-B-C-D-E-..., where
- "-" can be one of the five characters you mentioned
- the A, B, C, ... are alphanumeric pseudo-words
- A, C, E, ... or B, D, F, ... must have digits, i.e. alternating
digit components
So "A-1-B-2" and "1-A-2-B" would be kept as single tokens, but "A-B-1-2"
would not. Seems more than a little hokey, but I suppose it's been
working for a long time, for the most part.
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]