Doron Cohen wrote:

From the StandardAnalyzer javacc grammar :
  // floating point, serial, model numbers, ip addresses, etc.
  // every other segment must have at least one digit
  <NUM: (<ALPHANUM> <P> <HAS_DIGIT> .... etc.
  <#P: ("_"|"-"|"/"|"."|",") >
My understanding of this: a non-whitespace sequence is broken
at either of these 5 chars
   _  -  /  .  ,
unless the part that follows part has a digit, in which case
it is assumed to be (part of) a serial no., model, etc.

Weird. The definition seems to allow expressions of the form A-B-C-D-E-..., where
-   "-" can be one of the five characters you mentioned
-   the A, B, C, ... are alphanumeric pseudo-words
-   A, C, E, ... or B, D, F, ... must have digits, i.e. alternating
    digit components
So "A-1-B-2" and "1-A-2-B" would be kept as single tokens, but "A-B-1-2" would not. Seems more than a little hokey, but I suppose it's been working for a long time, for the most part.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to