Hello Simon,

Some observations that maybe will help:

1. Since the problem is obviously in the lexer,
you would probably prefer testing the lexer instead
of parser:

(define (test-lexer str)
  (let ((p (open-input-string str)))
    (port-count-lines! p)
    (let loop ()
      (let ((tok (position-token-token (toy-lexer p))))
        (printf "~a\n" tok)
        (unless (equal? tok 'eof)
          (loop))))))

2. You probably do not want the aliases to contain
whitespace:

(define-lex-abbrevs
  [...]
  (lex:whitespace (:or #\newline #\return #\tab #\space #\vtab)))


(define toy-lexer
  (lexer-src-pos
   [...]
   ((:&
    (:* (char-complement lex:whitespace))
    (complement (:: any-string "is" any-string))) (token-alias lexeme))
   [...]))

3. The problem with lexer is I think that it can not tell
"fact" from "alias" (because it can not look-ahead whether
there is a following "is" or not).

For example, if the input string is "remember somefact",
it gets "remember", skips the whitespace, and then
gets "somefact" as alias. To work that around, I would
change the lexer rules by putting mandatory quotes around
facts or something like that.

Best regards,

Dmitry



On 01/17/2012 05:35 AM, Simon Haines wrote:
I've been playing around with parser-tools and am having difficulty
expressing the following language:

"remember <alias> is <email>"
"remember <fact>"

where <alias> is any string that does not contain the word 'is', <email>
is a well-formed email address and <fact> is any string that does not
match the previous constraints.

Here's (stripped down) version of what I have so far:
#lang racket

(require parser-tools/lex
          parser-tools/yacc
          (prefix-in : parser-tools/lex-sre))

(define-lex-abbrevs
   (atext (:+ (:or alphabetic (:/ #\0 #\9) (char-set
"!#$%&'*+-/=?^_`{|}~"))))
   (dot-atom (:: atext (:* #\. atext))))

(define-tokens toy-tokens (addr-spec alias fact))
(define-empty-tokens empty-toy-tokens (eof REMEMBER IS))

(define toy-lexer
   (lexer-src-pos
    ; Consume whitespace
    ((:or #\tab #\space) (return-without-pos (toy-lexer input-port)))
    ; Email addresses
    ((:: dot-atom #\@ dot-atom) (token-addr-spec lexeme))

    ; Commands
    ("remember" 'REMEMBER)
    ("is" 'IS)
    ; ??? what to lex here ???
    ((complement (:: any-string "is" any-string)) (token-alias lexeme))
    (any-string (token-fact lexeme))))

(define toy-parser
   (parser
    (tokens toy-tokens empty-toy-tokens)
    (start start)
    (end eof)
    (error (lambda (a b c d e) (display (format "~a ~a ~a ~a ~a" a b c
                                                (position-offset d)
                                                (position-offset e)))))
    (src-pos)
    (grammar
     (start (() #f)
            ((REMEMBER alias IS addr-spec) `(alias ,$2 ,$4))
            ((REMEMBER fact) `(fact ,$2))))))

; test
(define (test str)
   (let ((p (open-input-string str)))
     (port-count-lines! p)
     (toy-parser (lambda () (toy-lexer p)))))

The problem I'm having is that the 'fact' lexer rule always matches
without giving a chance for the other rules to attempt a match. Perhaps
it is my ignorance with BNF. Can this language be expressed in this way?
An alternative I've thought of is to create a lexer rule to just match
"remember" then pass the port to another lexer that tries to look for
"is" or (eof) and munge the result into a token. Alternatively I could
try to regex the <alias>, <email> or <fact> clauses out and parse them
separately, but I'd like to compose this toy parser into a larger one if
possible. Yet I feel there is a simple technique here that I've missed
in my ignorance. Any ideas?
Many thanks, Simon.




____________________
   Racket Users list:
   http://lists.racket-lang.org/users

____________________
 Racket Users list:
 http://lists.racket-lang.org/users

Reply via email to