Very nice, thanks! Steffen
Am 11.02.21 um 14:52 schrieb Andreas Tille: > On Thu, Feb 11, 2021 at 03:16:51PM +0200, Andrius Merkys wrote: >> Hello, >> >> Recent thread on debian-science@ [1] motivated me to look deeper into >> enforcing quality standards of debian/upstream/metadata files (a.k.a. >> DEP 12) we ship with Debian packages. I learnt that lintian already runs >> YAML syntax check on debian/upstream/metadata files, but further >> validation is not performed (to my knowledge). Thus I have developed a >> formal validation tool [2] to check the contents inside these YAML >> files, mostly syntax of URLs and some fields that are defined to be in >> correspondence to BibTeX as per [3]. >> >> Yesterday I have downloaded debian/upstream/metadata files from all >>> 1300 projects under https://salsa.debian.org/debian-med/ and run >> against my validator. Resulting validation messages could be grouped >> into the following categories: >> >> 1. Highly possible typos: reference year '200' (bagpipe), '20015' >> (rambo-k), URLs with spaces (bio-tradis) and so on. This category is the >> one I was actually aiming at. > That's absolutely cool! Thanks a lot for this! I think that should be > of catergory lintian error. > >> 2. URLs with trailing newlines (adapterremoval, aevol, amos, just to >> name a few). This is most likely due to YAML property to append newline >> to the end of multiline strings, which can be quite easily averted [4]. >> On the other hand, trailing newlines in URLs could be ignored at all, as >> clearly they are not intentional. > That's helpful as well. I'd love to see this as lintian warning. > >> 3. Numeric months in references (augustus, cluster3, haploview, just to >> name a few). According to [3], "[Reference] keys that correspond to >> standard BibTeX entries must provide the same content", and 1988 BibTeX >> manual from CTAN [5] says "[month:] You should use the standard >> three-letter abbreviation". Of course "should" is not "must" (in terms >> of RFC 2119), but machine-reading would be easier with a consistent >> definition. > Interesting detail. I admit I do not mind a lot about this - but if it > is specified that way it is correct to mention it in the lintian check. > I'm not sure whether this should be 'info' or 'pedantic'. Feel free to > decide yourself. > >> 4. E-mail addresses in Bug-Submit (htslib, last-align, nanook, just to >> name a few). Per [3], values of Bug-Submit are URLs. Maybe [3] could be >> amended to cover e-mails too? > Its sensible to permit e-mails here since this is something where some > bugs need to be submitted. May be enforcing mailto:e@mail makes a proper > URL? > >> 5. Unclear scalar/list status of some fields. Only Screenshots is >> defined as "One or more URLs", while in reality lists appear for >> Webservice (clustalw, primer3), Bug-Submit (mira, albeit seems broken). >> Maybe these too could be defined as "One or more URLs"? > I have not thought about this but if there are obvious use cases for > lists it seems to be sensible to permit this. > >> 6. Empty templates (agat, intake, libpll-2, just to name a few). I would >> suggest removing the templates, as they do not carry anything meaningful. > That's at least worth a warning - may be even an error. > >> 7. DOIs written as URLs (fast, libnewuoa). This is debatable, and [5] >> does not talk about DOIs at all. > DOI is specified[6] and should not be an URL (I've just fixed libnewuoa > once I was checking it ... but leave fast to keep some "example" for > testing for you ;-) ) > >> As said earlier, I would be interested in implementing formal validation >> of debian/upstream/metadata in lintian to catch typos and so on. >> However, there are a few ambiguities in the specification, which would >> be really interesting to discuss and resolve. >> >> Please do not take any part of my text as a critique for anyone. Package >> names are here only for the purpose of illustration. > Your work (including critique as far as it concerns me) is perfectly > welcome and absolutely needed. I can't count any more how often I > needed to adapt the UDD gatherer for upstream metadata to be tolerant > against different kind of syntax issues. > > The lintian check should also verify typos in field names. Only > those fields that are specified[3] are permitted. > > Thanks again > > Andreas. > >> [1] https://lists.debian.org/debian-science/2021/01/msg00050.html >> [2] https://github.com/merkys/Debian-DEP12, no stable release yet >> [3] https://wiki.debian.org/UpstreamMetadata >> [4] https://yaml-multiline.info/ >> [5] >> https://mirror.datacenter.by/pub/mirrors/CTAN/biblio/bibtex/base/btxdoc.pdf > [6] https://en.wikipedia.org/wiki/Digital_object_identifier >