Quoth Alexander Krotov: > > Ideally, with sed/awk, or better in C. > > "Parsing" HTML with sed is simply wrong.
This is a good point that I should have mentioned. I spent years using sed and awk to extract things from HTML, writing crawlers and suchlike, for personal projects. It can work, of course, but tends to be very obfuscated and fragile. I haven't needed to do any such crawling for a while now (and often the data is easier to access as json, an unexpected side-effect of the horrors of javascript overuse), but if I needed to I'd likely look into using something like go's html parsing these days. I'd rather have something slightly slower that's more robust and reusable, really. awk is a good fit for line-based parsing, and sed is good for stream transformation, neither work well for parsing machine-generated mountains of HTML of the sort that dominates the web today.