On Sun, 21 Aug 2022 at 09:48, dn <pythonl...@danceswithmice.info> wrote: > > On 20/08/2022 12.38, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 10:19, dn <pythonl...@danceswithmice.info> wrote: > >> On 20/08/2022 09.01, Chris Angelico wrote: > >>> On Sat, 20 Aug 2022 at 05:12, Barry <ba...@barrys-emacs.org> wrote: > >>>>> On 19 Aug 2022, at 19:33, Chris Angelico <ros...@gmail.com> wrote: > >>>>> > >>>>> What's the best way to precisely reconstruct an HTML file after > >>>>> parsing it with BeautifulSoup? > ... > > >>> well. Thanks for trying, anyhow. > >>> > >>> So I'm left with a few options: > >>> > >>> 1) Give up on validation, give up on verification, and just run this > >>> thing on the production site with my fingers crossed > >>> 2) Instead of doing an intelligent reconstruction, just str.replace() > >>> one URL with another within the file > >>> 3) Split the file into lines, find the Nth line (elem.sourceline) and > >>> str.replace that line only > >>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start > >>> of the tag, manually find the end, and replace one tag with the > >>> reconstructed form. > >>> > >>> I'm inclined to the first option, honestly. The others just seem like > >>> hard work, and I became a programmer so I could be lazy... > >> +1 - but I've noticed that sometimes I have to work quite hard to be > >> this lazy! > > > > Yeah, that's very true... > > > >> Am assuming that http -> https is not the only 'change' (if it were, > >> you'd just do that without BS). How many such changes are planned/need > >> checking? Care to list them? > > This project has many of the same 'smells' as a database-harmonisation > effort. Particularly one where 'the previous guy' used to use field-X > for certain data, but his replacement decided that field-Y 'sounded > better' (or some such user-logic). Arrrggghhhh! > > If you like head-aches, and users coming to you with ifs-buts-and-maybes > AFTER you've 'done stuff', this is your sort of project!
Well, I don't like headaches, but I do appreciate what the G&S Archive has given me over the years, so I'm taking this on as a means of giving back to the community. > > Assumption is correct. The changes are more of the form "find all the > > problems, add to the list of fixes, try to minimize the ones that need > > to be done manually". So far, what I have is: > > Having taken the trouble to identify this list of improvements and given > the determination to verify each, consider working through one item at a > time, rather than in a single pass. This will enable individual logging > of changes, a manual check of each alteration, and the ability to > choose/tailor the best tool for that specific task. > > In fact, depending upon frequency, making the changes manually (and with > improved confidence in the result). Unfortunately the frequency is very high. > The presence of (or allusion to) the word "some" in this list-items is > 'the killer'. Automation doesn't like 'some' (cf "all") unless the > criteria can be clearly and unambiguously defined. Ouch! > > (I don't think you need to be told any of this, but hey: dreams are free!) Right; the criteria are quite well defined, but I omitted the details for brevity. > > 1) A bunch of http -> https, but not all of them - only domains where > > I've confirmed that it's valid > > The search-criteria is the list of valid domains, rather than the > "http/https" which is likely the first focus. Yeah. I do a first pass to enumerate all domains that are ever linked to with http:// URLs, and then I have a script that goes through and checks to see if they redirect me to the same URL on the other protocol, or other ways of checking. So yes, the list of valid domains is part of the program's effective input. > > 2) Some absolute to relative conversions: > > https://www.gsarchive.net/whowaswho/index.htm should be referred to as > > /whowaswho/index.htm instead > > Similarly, if you have a list of these. It's more just the pattern "https://www.gsarchive.net/<anything>" and "https://gsarchive.net/<anything>", and the corresponding "http://" URLs, plus a few other malformed versions that are worth correcting (if ever I find a link to "www.gsarchive.net/<anything>", it's almost certainly missing its protocol). > > 3) A few outdated URLs for which we know the replacement, eg > > http://www.cris.com/~oakapple/gasdisc/<anything> to > > http://www.gasdisc.oakapplepress.com/<anything> (this one can't go on > > HTTPS, which is one reason I can't shortcut that) > > Again. Same; although those are manually entered as patterns. > > 4) Some internal broken links where the path is wrong - anything that > > resolves to /books/<anything> but can't be found might be better > > rewritten as /html/perf_grps/websites/<anything> if the file can be > > found there > > Again. The fixups are manually entered, but I also need to know about every broken internal link so that I can look through them and figure out what's wrong. > > 5) Any external link that yields a permanent redirect should, to save > > clientside requests, get replaced by the destination. We have some > > Creative Commons badges that have moved to new URLs. > > Do you have these as a list, or are you intending the automated-method > to auto-magically follow the link to determine any need for action? The same script that checks for http->https conversion probes all links and checks to see if (a) it returns a perm redirect, or (b) it returns an error. Fix the first group, log the second, leave anything else alone. > > And there'll be other fixes to be done too. So it's a bit complicated, > > and no simple solution is really sufficient. At the very very least, I > > *need* to properly parse with BS4; the only question is whether I > > reconstruct from the parse tree, or go back to the raw file and try to > > edit it there. > > At least the diffs would give you something to work-from, but it's a bit > like git-diffs claiming a 'change' when the only difference is that my > IDE strips blanks from the ends of code-lines, or some-such silliness. Right; and the reconstructed version has a LOT of those unnecessary changes. I'm seeing a lot of changes to whitespace. The only problem is whether I can be confident that none of those changes could ever matter. > Which brings me to ask: why "*need* to properly parse with BS4"? Well, there's a *need to properly parse*, because I don't want to summon "the One whose Name cannot be expressed in the Basic Multilingual Plane" by using regular expressions on HTML. Am open to other suggestions; BS4 is the single most obvious one, but by no means the only way to do things. > What about selective use of tools, previously-mentioned in this thread? I've answered the option of regular expressions; did I miss any other HTML-aware tools being mentioned? If so, my apologies, and feel free to remind me. > Is Selenium worthy of consideration? Yes..... but I don't know how much it would buy me. It certainly has no options for editing back the original HTML, so all it would do is the parsing side of things (which is already working fine). > I'm assuming you've already been using a link-checker utility to locate > the links which need to be changed. They can be used in QA-mode > after-the-fact too. I actually haven't, but only because I figured that the autofixer would do the same job as the link-checker. Or rather, I wrote my own link-checker because I needed it to do more. And again, most standard utilities merely list the problems, they don't have a way to fix them. > > For the record, I have very long-term plans to migrate parts of the > > site to Markdown, which would make a lot of things easier. But for > > now, I need to fix the existing problems in the existing HTML files, > > without doing gigantic wholesale layout changes. > > ...and there's another option. If the Markdown conversion is done first, > it will obviate any option of diffs completely. However, it will > introduce a veritable cornucopia of opportunity for this and 'other > stuff' to go-wrong, bringing us back to a page-by-page check or > broad-checks only, and an appeal to readers to report problems. Yeah, and the fundamental problem with the MD conversion is time - it's a big manual process. I want to be able to do that progressively over time, but get the basic stuff sorted out much sooner. Ideally, it should be possible to fix all the autofixable links this week and get that sorted out, but converting pages to Markdown will happen slowly over the next few years. > The (PM-oriented) observation is that if you are baulking at the amount > of work 'now', you'll be equally dismayed by the consequences of a > subsequent 'Markdown project'! Nah, there's no rush on it, and I know from experience how much benefit it can give :) > Perhaps, therefore, some counter-intuitive logic, eg combining the > two/biting two bullets/recognising that many of risks and likelihoods of > error overlap (rather than add/multiply). That's true, and for new pages, it's way easier to handle (for instance, this page https://gsarchive.net/html/dixon.html did not exist prior to my curatorship - for obvious reasons - and I created it as a Markdown file). > 'Bit rot' is so common in today's world, do readers treat such > pages/sites particularly differently? That's what I am unsure of, and why I would prefer to make as few unnecessary changes as possible. However, I am leaning more and more strongly towards "just let BS4 do its canonicalization", given that all the alternatives posted here have been worse. > Somewhat conversely, even in our 'release-often, break-early' world, do > users often exert themselves to provide constructive feedback, eg 'link > broken'? Maybe? But there are always pages that only a few people ever look at (this is a vast archive and some of its content is *extremely* niche), so I would prefer to preempt the issues. Appreciate the thoughts. ChrisA -- https://mail.python.org/mailman/listinfo/python-list