On 2021-12-08, Julius Hamilton <juliushamilton...@gmail.com> wrote: > 1. The HTML extraction is not perfect. It doesn’t produce as clean text as > I would like. Sometimes random links or tags get left in there. And the > sentences are sometimes randomly broken by newlines.
Oh. Leaving tags in suggests you are doing this very wrongly. Python has plenty of open source libraries you can use that will parse the HTML reliably into tags and text for you. > 2. Neither is the segmentation perfect. I am currently researching > developing an optimal segmenter with tools from Spacy. > > Brevity is greatly valued. I mean, anyone who can make the program more > perfect, that’s hugely appreciated. But if someone can do it in very few > lines of code, that’s also appreciated. It isn't something that can be done in a few lines of code. There's the spaces issue you mention for example. Nor is it something that can necessarily be done just by inspecting the HTML alone. To take a trivial example: powergen<div>italia</div> = powergen <nl> italia but: powergen<span>italia</span> = powergenitalia but the second with the addition of: <style>span { dispaly: block }</style> is back to "powergen <nl> italia". So you need to parse and apply styles (including external stylesheets) as well. Potentially you may also need to execute JavaScript on the page, which means you also need a JavaScript interpreter and a DOM implementation. Basically you need a complete browser to do it on general web pages. -- https://mail.python.org/mailman/listinfo/python-list