On Tue, 08 Feb 2022 at 08:59:23 -0500, Scott Kitterman wrote: > From my point of view, treating something like other common classes of RC > bugs > means that the project is producing tools and processes to make detection of > such bugs more automated to remove them from the archive, that developers are > actively looking for them, and that they are routinely fixed in the normal > course of Debian development.
I think part of the problem here might be that copyright information is "social", not "technical": software authors can claim copyright and/or authorship in various forms of human-readable, free-form text, which means any automated detection is necessarily going to be imperfect, and as long as our policy demands perfection, there will be a reluctance to automate this (or at least a reluctance to say that we are automating it). Another part of the problem is that licensing and copyright-information bugs are not something that we are realistically going to find through normal use of software: if GTK crashes when you print on a Tuesday, one of our users will eventually notice, but if we have missed a copyright holder, it's unlikely that anyone is going to notice that omission from the list of around 400 potential copyright holders in <https://tracker.debian.org/media/packages/g/gtk4/copyright-4.6.0ds1-3> unless they repeat the time-consuming process of collecting possible copyright claims from the source code (as the ftp team presumably do). I have no idea how the maintainers of larger and more complicated packages manage to do this, or how the ftp team manage to review larger and more complicated packages in a finite time. I think the copyright file is doing several things which are perhaps in conflict: * It lets consumers of packages know what restrictions apply to their use of a package - This requires *most* of the license information, although not necessarily all of it: for example if a package like Linux is licensed under a mixture of GPL, LGPL, BSD and MIT licenses, it's usually sufficient to be aware of the most restrictive of those licenses, in this case GPL - Having too much information, however, well-intentioned, actually works against this by making it harder to find what you need - I would argue that requiring the text of licenses like the CC family to be inlined into the copyright file works against this goal, by reducing the signal-to-noise ratio: if you are not familiar with a particular license, then obviously you will need to read its text to see what it means, but if you are looking at packages that have content under various semi-common licenses, you only need to read each license once - I would argue that requiring lists of copyright holders in the same file to be inlined into the copyright file also works against this goal, again by harming the signal-to-noise ratio * It lets consumers of packages know that the package is DFSG-compliant - Same requirements as above * It's a place to reproduce information that licenses require us to, like a comprehensive set of copyright notices (if our interpretation of the applicable licenses is that pointing to nearby source code and calling it extremely comprehensive accompanying documentation is insufficient) - In this role, it's essentially write-only: we're doing this because we have been required to do it, more than because it's practically useful, and I don't expect anyone to actually read this, except for the maintainer when collecting it and the ftp team when verifying that it has been collected - In another subthread, Stephan Lachnit suggests using the SPDX format for this write-only information, which I think might be intended as a way to eventually separate it from the other roles of d/copyright * It gives authors due credit (which we are not *required* to do, but in previous discussions of d/copyright I've seen this cited as a reason why we *should* do this in order to be good citizens) - Note that collecting copyright holders is not necessarily actually helpful here, because that often means we are required to "credit" an employer, rather than mentioning the actual author - In a medium-sized package like GTK, it's not clear to me that a list of about 400 possible copyright holders is actually serving this purpose, because any individual contributor is lost in the noise * It lets us meet our self-imposed rules - This is circular, so I'm inclined to disregard it when discussing what the rules should be: we should set rules because they help us to achieve a goal, rather than for the sake of having rules * It lets the ftp team (or other interested reviewers) duplicate the info-collecting process to check that all of the above have been done - This is somewhat circular, because this is a way to support the other goals, not really a goal in its own right * Are there other relevant goals that I've missed here? I don't think conflating those goals and assuming they all need to be satisfied by a single file is necessarily going to lead to meeting any of those goals in an efficient way, let alone meeting all of them in an efficient way. smcv