Maxim Cournoyer <maxim.courno...@gmail.com> writes: > Hi Tomas, > > Thank you for reporting this issue. > > Tomas Volf <~@wolfsden.cz> writes: > >> <to...@tuxteam.de> writes: >> >>> On Sat, Feb 01, 2025 at 09:10:04PM +0100, Tomas Volf wrote: >>>> >>>> Hello, >>>> >>>> I think I found a bug in the htmlprag module in guile-lib. When parsing >>>> attributes, the values are not properly decoded: >>>> >>>> --8<---------------cut here---------------start------------->8--- >>>> scheme@(guile-user)> ,use (htmlprag) >>>> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />") >>>> $1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd")))) >>>> scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />") >>>> $2 = (*TOP* (a (@ (href "a&b")))) >>>> --8<---------------cut here---------------end--------------->8--- >>>> >>>> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b". >>> >>> Ouch. Have you contacted Oleg Kiselyov about it? He's usually pretty >>> responsive and very friendly. >> >> I did not. I did not find a "how to report bugs" section on guile-lib's >> website, and on the (htmlprag) documentation section Oleg Kiselyov is >> mentioned only in one sentence as a "Thanks". >> >> I think I have managed to find his email in one Haskell paper of his, so >> I will CC him on the bug report, as suggested. > > And also for containing Oleg. I hope they can provide us with their > opinion on whether this is an actual bug or was designed that way. To > me, it's not clear whether html->sxml should alterate the raw value of > attributes in any way.
It already modifies the raw value for regular HTML text: --8<---------------cut here---------------start------------->8--- scheme@(htmlprag)> (html->sxml "a&b") $10 = (*TOP* "a&b") scheme@(htmlprag)> (sxml->html '(*TOP* "a&b")) $13 = "a&b" --8<---------------cut here---------------end--------------->8--- I now noticed this also affect encoding: --8<---------------cut here---------------start------------->8--- scheme@(htmlprag)> (sxml->html '(*TOP* (a (@ (href "a&b"))))) $12 = "<a href=\"a&b\"></a>" --8<---------------cut here---------------end--------------->8--- I am not sure why attributes should be special here. For what it is worth, (sxml simple) itself decodes even attributes: --8<---------------cut here---------------start------------->8--- scheme@(htmlprag)> (xml->sxml "<a href=\"a&b\"></a>") $11 = (*TOP* (a (@ (href "a&b")))) --8<---------------cut here---------------end--------------->8--- For comparison, Firefox seems to decode the attributes as well even in HTML. That is actually how I discovered this issue, links I extracted from <a href=".."> using html->sxml were not working until I ran a decoding pass on them. > Users may haev different use cases requiring to apply different > transformation themselves? I agree in the abstract, but do you have any specific use case in mind when you would want to use the raw content of attributes (especially since you already cannot get raw content of text nodes). > If we hard-code a decoding scheme ourselves, then force that choice > onto users, no? I agree we cannot hard-code or change it now due to compatibility concerns, but adding #:decode-attributes to html->sxml, #:encode-attributes to sxml->html and possibly %deencode-attributes? parameter, in the spirit of %strict-tokenizer? would seem reasonable. Tomas -- There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors.
signature.asc
Description: PGP signature