>The character-set you're referring to, is it US-ASCII? I am not particularly 
>familiar with how Guile handles characters. If string-filter is not 
>sufficient, can you suggest another method?
>
>For example, perhaps we need to go to where "str" is read and set the port 
>encoding to US-ASCII. Right now it's Iso Latin which is a superset of 
>US-ASCII, and therefore improper.

I meant “character set” in the sense as used in Guile, not character encoding.
Very literally, it means a “set of characters”, where ‘set’ is used in the 
mathematical sense. ‘Character’ means any character in Unicode (not counting 
those special pairs used for UTF-16, they aren’t characters).

Given you mentioned char-set:graphic, I thought you already knew.

So, the answer is, no, it’s not ASCII (the character set), it’s a subset of 
US-ASCII defined in the HTTP spec. IIRC, I referred to:

➢ https://www.rfc-editor.org/rfc/rfc9110.html#name-tokens

(in particular see ‘tchar’) which I think is pretty clearly not all of ASCII 
but rather a subset. Explicitly, the character set I’m referring to is the 
‘tchar’ mentioned in the RFC.

On string-filter: I suppose you could use that, (string=? (string-filter 
the-char-set ...) original-string), to check things, but it seems more 
efficient and simpler to use the predicate string-every instead.

That said, it might be worth looking at how the caller(s) of  the method 
parsing procedure uses the method parsing procedure. It might be the case that 
they use something to (string-index s everything-except-tchar begin end) to 
locate the end of the method name. In that case, the argument passes to the 
method parsing procedure is correct by construction (assuming length>0), so 
then that procedure doesn’t need to do any checks and can leave (with a 
docstring) that responsibility to the caller.

>For example, perhaps we need to go to where "str" is read and set the port 
>encoding to US-ASCII. Right now it's Iso Latin which is a superset of 
>US-ASCII, and therefore improper.

Eh, while HTTP might look like text, it’s more like a mix of text and 
octets/bytes:

Field values are usually constrained to the range of US-ASCII characters 
[USASCII]. __Fields needing a greater range of characters can use an 
encoding__, such as the one defined in [RFC8187]. Historically, HTTP allowed 
field content with text in the __ISO-8859-1__ charset [ISO-8859-1], supporting 
other charsets only through use of [RFC2047] encoding. Specifications for newly 
defined fields SHOULD limit their values to visible US-ASCII octets (VCHAR), 
SP, and HTAB. __A recipient SHOULD treat other allowed octets in field content 
(i.e., obs-text) as opaque data__.

(emphasis added)

I interpret this as “HTTP prefers only US-ASCII(see SHOULD), but it’s not 
strictly required (depending on the field), and sometimes it doesn’t even have 
any meaning as characters and instead is only raw bytes(*)”.  Also see the bit 
about ISO-8559-1, it appears that in at least some case, the ISO-8559-1 
encoding should be recognised.

(I might be misinterpreting this though, perhaps it is referring to %-encoding.)

Also, using ISO Latin 1 (or another ASCII (the character encoding)-compatible 
8-bit encoding) is convenient for handling octets and US-ASCII characters 
together.

Maybe separating the US-ASCII from the extra octets might make the code more 
proper in some aesthetical sense, but I don’t think it would make things more 
proper in a RFC-compliant sense (though neither would it make things worse, I 
suppose).

(There might be bugs w.r.t. character encoding in the Guile implementation, but 
I don’t think this is one of them.)

>That being said, the best form for this function is:
>(string->symbol (substring str start end) )
>With additional logic added to other functions?

I am not familiar enough with the Guile implementation to tell if the extra 
logic is best done in this function or in its caller. It just needs to be done 
_somewhere_.

Best regards,
Maxime Devos

Reply via email to