Hi guile-users, Hope you're all very well! I have a question about using shtml with htmlprag - as far as I know this module isn't actually part of Guile, and it looks like it's quite old now and maybe no longer under active development, but if anyone has any insights I'm keen to see if I can get another set of eyes on this issue I'm having.
I'm new to Guile, and to learn the language I'm building a web crawler. As part of this, I'm using htmlprag and sxpath to convert some HTML to shtml and pull some interesting data out of the shtml. I have the following HTML (I wrote this up for example's sake): <!DOCTYPE html> <html> <head> <title>Example</title> </head> <body> <header class="exampleHeader"> <img id="bannerImage" src="https://www.gnu.org/software/guile/static/base/img/branding.png"> <div> <p id="labelName">A label for the header.</p> </div> <p id="labelDescription">Some description of the header.</p> </header> <div id="exampleDiv"> <hr> <div id="divMessage">An example message.</div> </div> <footer id="footer"></footer> </body> </html> When I used html->shtml I got the following shtml: (*TOP* (*DECL* DOCTYPE html) (html (head (title Example) ) (body (header (@ (class exampleHeader)) (img (@ (id bannerImage) (src https://www.gnu.org/software/guile/static/base/img/branding.png))) (div )) (p (@ (id labelName)) A label for the header.) (p (@ (id labelDescription)) Some description of the header.) (div (@ (id exampleDiv)) (hr) (div (@ (id divMessage)) An example message.) ) (footer (@ (id footer))) ) ) ) I would have however expected something like (div (p (@ (id labelName)) A label for the header.)) under the header[@class="exampleHeader"] tag (I haven't tested this exact s-expression though). Instead, the p tag sits outside the div tag. When I do shtml->html over this shtml, I get the following html: <!DOCTYPE html> <html> <head> <title>Example</title> </head> <body> <header class="exampleHeader"> <img id="bannerImage" src="https://www.gnu.org/software/guile/static/base/img/branding.png" /> <div> </div></header><p id="labelName">A label for the header.</p> <p id="labelDescription">Some description of the header.</p> <div id="exampleDiv"> <hr /> <div id="divMessage">An example message.</div> </div> <footer id="footer"></footer> </body> </html> The p[@id="labelName"] tag no longer sits under the div tag. This means when I use an sxpath expression like '(// html body (header (@ (eq? "exampleHeader")))), I get the img tag and an empty div tag, but no p tag - like so: ((header (@ (class exampleHeader)) (img (@ (id bannerImage) (src https://www.gnu.org/software/guile/static/base/img/branding.png))) (div ))) I'm wondering if I've missed something, or if others get this kind of behaviour. The upshot of this is that, for the HTML above, it looks like (equal? example-html (shtml->html (html->shtml example-html))) is false, which isn't what I'd expect. Is there something funny that happens with `p`? Thanks a lot, Kenan NB. In the sxml example above all the strings aren't surrounded by double quotes, but I think this is an artefact of how I'm writing them to files for testing purposes - see an extract of the sxml below when I use ,pretty-print in Geiser: (div (@ (id "exampleDiv")) "\n" " " (hr) "\n" " " (div (@ (id "divMessage")) "An example message.") "\n" " ")
signature.asc
Description: OpenPGP digital signature