Hi David,

"Thompson, David" <dthomps...@worcester.edu> writes:

> Hi Maxim,
>
> Thanks for the review!

My pleasure!  It's a small thing I can do to try to help nudge builtin
JSON support in Guile, which I agree has its place in 2025.  In general,
I'd like Guile to compete a bit more with Python in the out-of-the-box
library capabilities.  Especially since outside of Guix installing Guile
libraries is probably not that fun (I'd suspect many Guile libraries to
be missing or outdated on various distributions), so a more useful core
has even more value there.

[...]

>> Is there particular reason for using vectors instead of plain list to
>> represent JSON arrays?  The later would be more idiomatic unless there
>> are technical reasons (perhaps performance?).
>
> Back in 2015 I chose lists out of convenience because lists are such a
> fundamental data structure in Scheme. It is very tempting! However,
> there are a couple of issues: 1) O(n) access time for lists is not
> great when there's a perfectly suitable O(1) data structure with read
> syntax sitting right there and 2) It creates ambiguity when lists are
> also used to represent objects. My previous patch used the symbol '@
> as a sentinel to mark objects, but it was kinda gross and I grew to
> dislike it more and more over the years. SRFI-180 uses vectors,
> Racket's JSON library uses vectors, etc. and I think they made the
> right call.

I see.  It makes sense, thanks for explaining.

[...]

>> Hm, I wonder what (list port) looks like in the irritants when the
>> exception is reported; is it useful?  Shouldn't it show instead the
>> problematic value?
>
> By including the port in the irritants it allows the exception handler
> to use 'port-line' and 'port-column' to show where in the JSON
> document the error occurred (for ports with such information like file
> ports).

OK!

>> > +  (define (consume-whitespace)
>> > +    (case (peek-char port)
>> > +      ((#\space #\tab #\return #\newline)
>>
>> Should a match + ((? char-whitespace?)) predicate pattern be used here
>> instead, or similar?  Or perhaps the above is faster and more
>> self-contained, which can be a good thing.
>
> ECMA-404 states that these 4 characters are the only acceptable
> whitespace characters:
>
> "Whitespace is any sequence of one or more of the following code
> points: character tabulation (U+0009), line feed (U+000A), carriage
> return (U+000D), and space (U+0020)."

Makes sense.

[...]

>> > +  (define (read-fraction)
>> > +    (case (peek-char port)
>> > +      ((#\.)
>> > +       (read-char port)
>> > +       (let lp ((mag 10))
>> > +         (let ((n (read-digit-maybe)))
>> > +           (if n (+ (/ n mag) (lp (* mag 10))) 0))))
>> > +      (else 0)))
>>
>> Should the above be named 'read-decimal' ?  Does a decimal number in
>> JSON always start with '.' and not with 0. ?  I was a bit puzzled on
>> what 'mag' may mean here, I guess 'magnitude' although there doesn't
>> appear to have a clear terminology for it.
>
> It's called 'read-fraction' because it reads the fractional part of
> the number.  I considered 'read-decimal' but I think it's less
> descriptive. This procedure is called from two places, after the
> respective caller has already read the digits before the decimal
> point. 'mag' does indeed mean 'magnitude', as each digit read
> increases the order of magnitude of the resulting number. I borrowed
> this name from Hoot's 'read' implementation, which means it's probably
> in Guile's pure-Scheme 'read' implementation as well.

OK, thanks for explaining.  I realized later in my review that these are
very dependent on when they are called (they are very specific or
narrow, more than their name might suggest).  It's fine.

[...]

> The following clause checks for '.'. We need to distinguish between 3
> types of input here:
>
> - invalid extraneous zeroes like '09' (ECMA-404 says this is not allowed)
> - fractional notation like '0.123'
> - plain '0'
>
> If I leave out this clause, we can no longer distinguish errors from
> plain zeroes.  There's likely other ways to express it but this feels
> straightforward enough.
>
> However, it was a great idea to take another look at this code because
> there is a bug! '0e3' is a valid JSON number that this code rejects.
> It should parse to 0.  My updated patch fixes this and adds a test
> case.

Good catch!

[...]

>> > +  (define (write-object obj)
>> > +    (put-char port #\{)
>> > +    (match obj
>> > +      ((head . rest)
>> > +       (write-pair head)
>> > +       (let lp ((obj rest))
>> > +         (match obj
>> > +           (() (values))
>>
>> Any reason to return (values) instead of some dummy #t to denote 'no-op'
>> ?.
>
> The loop is evaluated for effect and thus there is nothing to return
> so returning 0 values makes sense. This is something I picked up from
> Andy Wingo.

Interesting.

>> > +           ((head . rest)
>> > +            (put-char port #\,)
>> > +            (write-pair head)
>> > +            (lp rest))
>> > +           (_ (fail "invalid object" obj))))))
>> > +    (put-char port #\}))
>> > +  (define (write-array v)
>> > +    (put-char port #\[)
>> > +    (match (vector-length v)
>> > +      (0 (values))
>> > +      (n
>> > +       (write-value (vector-ref v 0))
>> > +       (do ((i 1 (1+ i)))
>> > +           ((= i n))
>> > +         (put-char port #\,)
>> > +         (write-value (vector-ref v i)))))
>>
>> I suppose the above is more efficient than a for-each loop?  I'd be
>> curious to see it profiled, if you still have data.  At least now I see
>> than for > 100k, vector-ref is faster than list-ref, which probably
>> explains why you went with vectors (could still be an implementation
>> detail with the list->vector call left in the writer though, in my
>> opinion).
>
> Yes, it is more efficient because it avoids closure allocation. Also,
> I just don't see any reason to import (rnrs base) or (srfi srfi-43) to
> get vector-for-each when looping over a vector is trivial. I didn't
> profile anything.

Yeah, I was thinking in terms of plain lists, didn't know we didn't
iterators for vectors in the core (I've seldom used vectors).  Sounds
reasonable.

>> > +    (put-char port #\]))
>> > +  (define (write-number x)
>> > +    (if (or (exact-integer? x)
>> > +            (and (real? x)
>> > +                 (inexact? x)
>> > +                 ;; NaNs and infinities are not allowed.
>> > +                 (not (or (nan? x) (inf? x)))))
>> > +        ;; Scheme's string representations of exact integers and floats
>> > +        ;; are compatible with JSON.
>> > +        (put-string port (number->string x))
>> > +        (fail "invalid number" x)))
>> > +  (define (write-value x)
>> > +    (match x
>> > +      (#t (put-string port "true"))
>> > +      (#f (put-string port "false"))
>> > +      ('null (put-string port "null"))
>> > +      (() (put-string port "{}"))
>> > +      ((? pair?) (write-object x))
>> > +      ((? vector?) (write-array x))
>> > +      ((? string?) (write-string x))
>> > +      ((? number?) (write-number x))
>> > +      (_ (fail "invalid value" x))))
>> > +  (write-value exp))
>>
>> Phew.  That's a pretty low-level parser!  I hope it's fast, otherwise it
>> seems it'd be more concise/fun/maintainable to devise a PEG-based one,
>> which appears to be doable for JSON, from what I've read.  Perhaps
>> sprinkle with a few performance-related comments where such concerns
>> impacted the design choices, so that we can remember and retest/reverify
>> these in the future when Guile evolves.
>
> JSON is a pretty simple format and thus I think a hand-rolled parser
> is appropriate.  It's much simpler than 'read', anyway.  I suppose it
> would be more concise, but "PEG parser" and "fun" do not go together
> for me.  At ~300 lines of quite simple code (I did not go hog wild on
> macros or fancy abstractions nor did I sacrifice readable code for
> performance) I don't think there is much concern regarding
> maintenance.  If someone wants to experiment to see how a PEG parser
> compares, though, feel free.

Someone could always make that experiment in the future and suggest a
replacement, if the performance is as good or better and the code
simpler.  LGTM.

Reviewed-by: Maxim Cournoyer <maxim.cournoyer@gmail>

-- 
Thanks,
Maxim



Reply via email to