Re: [dev] suckless html to markdown (text)

Nick Sun, 06 Jan 2019 14:23:21 -0800

Quoth Alexander Krotov:
> > Ideally, with sed/awk, or better in C.
> 
> "Parsing" HTML with sed is simply wrong.


This is a good point that I should have mentioned. I spent years 
using sed and awk to extract things from HTML, writing crawlers and 
suchlike, for personal projects. It can work, of course, but tends 
to be very obfuscated and fragile. I haven't needed to do any such 
crawling for a while now (and often the data is easier to access as 
json, an unexpected side-effect of the horrors of javascript 
overuse), but if I needed to I'd likely look into using something 
like go's html parsing these days.  I'd rather have something 
slightly slower that's more robust and reusable, really.  awk is a 
good fit for line-based parsing, and sed is good for stream 
transformation, neither work well for parsing machine-generated 
mountains of HTML of the sort that dominates the web today.

Re: [dev] suckless html to markdown (text)

Reply via email to