Re: [racket-users] Html to text, how to obtain a rough preview

Neil Van Dyke Tue, 30 May 2017 22:34:59 -0700


Erich Rast wrote on 05/30/2017 04:37 PM:

I've found out that it's far less trivial than expected, but not
because of sxml or the tree walking itself. Often call/input-url
just returns '() or '(*TOP*),

To troubleshoot this... First, I'd take a quick look at the applicationcode. Then I'd check whether the HTTP request is returning the HTML,and copy that HTML. Then I'd feed that copied HTML to `html-parsing`,to see whether there's something weird about the HTML that hasdiscovered a bug in the parser. Then I'd look more closely at thesurrounding application code, to see whether it's introducing theproblem, and to see whether it's doing important error-checking. ThenI'd see whether it's some transient cause (especially in the successfuldownloading of HTML), or a heisenbug. Something like that.

  and sometimes it also fails with an
exception on https addresses.

If this is happening *consistently* on particular HTTPS URL domains andports, without knowing more about how the failure exhibits, I'd start tothink SSL/TLS version problem, or a certificate authentication problem.

If this is happening *intermittently* for a particular HTTPS URL domainand port, then I'd have to know more about the failure exhibits, andfrom what examples of requests, to start to troubleshoot.

  Then some websites also seem to be fully
dynamic with javascript and just return a lot of gobbledygook.

Sadly, JavaScript uses in practice mean that any HTML scrapergeneralized to *all* Web sites now has to basically perform the(anti-engineering, cracksmoking) atrocity that is the modern Web pageload, including running JS for the DOM, and perhaps some sense oflayout/rendering semantics. I don't currently have tools set up forthis, but it's doable.

If you're doing scraping of only a small number of specific Websites/pages, that's a different problem, and the best path might be tohandle any quirks of each one individually. Especially if you don'twant to lose some information that is interpretable to humans, or thatrequires some interaction, but is lost if you just do a generic JS pageload and scrape.

Also note that each site's layout and interaction structure is a movingtarget. Every now and then, they'll change their layouts/UI, theirdevelopment frameworks or CMSs, their CDNs. I've been doing Webscraping since the mid-1990s (starting with my own Java parser, before Iwrote the current Scheme/Racket one), and I currently do things likemaintain some metadata about which CDNs which sites are using, and whatthe anti-privacy and anti-security hooks are... and something is alwayschanging, on some site.

I'm happy to offer any tips that I can, as well as fix any bug found in`html-parsing`. Getting into the details of a particular site or code,OTOH, can take a lot of work, and is part of how I make a living as aconsultant. :) As always, other Racketeers and I will help on the emaillist, to the extent we can.


--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

Reply via email to