Erich Rast wrote on 05/30/2017 04:37 PM:
I've found out that it's far less trivial than expected, but not because of sxml or the tree walking itself. Often call/input-url just returns '() or '(*TOP*),
To troubleshoot this... First, I'd take a quick look at the application code. Then I'd check whether the HTTP request is returning the HTML, and copy that HTML. Then I'd feed that copied HTML to `html-parsing`, to see whether there's something weird about the HTML that has discovered a bug in the parser. Then I'd look more closely at the surrounding application code, to see whether it's introducing the problem, and to see whether it's doing important error-checking. Then I'd see whether it's some transient cause (especially in the successful downloading of HTML), or a heisenbug. Something like that.
and sometimes it also fails with an exception on https addresses.
If this is happening *consistently* on particular HTTPS URL domains and ports, without knowing more about how the failure exhibits, I'd start to think SSL/TLS version problem, or a certificate authentication problem.
If this is happening *intermittently* for a particular HTTPS URL domain and port, then I'd have to know more about the failure exhibits, and from what examples of requests, to start to troubleshoot.
Then some websites also seem to be fully dynamic with javascript and just return a lot of gobbledygook.
Sadly, JavaScript uses in practice mean that any HTML scraper generalized to *all* Web sites now has to basically perform the (anti-engineering, cracksmoking) atrocity that is the modern Web page load, including running JS for the DOM, and perhaps some sense of layout/rendering semantics. I don't currently have tools set up for this, but it's doable.
If you're doing scraping of only a small number of specific Web sites/pages, that's a different problem, and the best path might be to handle any quirks of each one individually. Especially if you don't want to lose some information that is interpretable to humans, or that requires some interaction, but is lost if you just do a generic JS page load and scrape.
Also note that each site's layout and interaction structure is a moving target. Every now and then, they'll change their layouts/UI, their development frameworks or CMSs, their CDNs. I've been doing Web scraping since the mid-1990s (starting with my own Java parser, before I wrote the current Scheme/Racket one), and I currently do things like maintain some metadata about which CDNs which sites are using, and what the anti-privacy and anti-security hooks are... and something is always changing, on some site.
I'm happy to offer any tips that I can, as well as fix any bug found in `html-parsing`. Getting into the details of a particular site or code, OTOH, can take a lot of work, and is part of how I make a living as a consultant. :) As always, other Racketeers and I will help on the email list, to the extent we can.
-- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.