Christopher Baines <m...@cbaines.net> writes:
[[PGP Signed Part:Undecided]]
Ian Eure <i...@retrospec.tv> writes:
Hi Guixy people,
I’d never heard of SWH before I started hacking on Guix last
fall, and
it struck me as rather a good idea. However, I’ve seen some
things
lately which have soured me on them.
They appear to be using the archive to build LLMs:
https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
I was also distressed to see how poorly they treated a
developer who
wished to update their name:
https://cohost.org/arborelia/post/4968198-the-software-heritag
https://cohost.org/arborelia/post/5052044-the-software-heritag
GPL’d software I’ve created has been packaged for Guix, which I
assume
means it’s been included in SWH. While I’m dealing with their
(IMO:
unethical) opt-out process, I likely also need to stop new
copies from
being uploaded again in the future.
Is there a way to indicate, in a Guix package, that it should
*never*
be included in SWH?
Not currently, and I don't really see the point in such a
mechanism. If
you really never want them to store your code, then you need to
license
it accordingly (and not make it free software).
I don’t want my code in SWH *because* it’s free. A primary use of
LLMs is laundering freely licensed software into proprietary,
commercial projects through "AI" code completion and generation.
Any Free software in an LLM training set can and will be used in
violation of its license, without a clear path for the author to
seek recourse. I deleted my code off Github and abandoned it
completely for this exact reason, and am deeply irked to be going
through this nonsense again.
A more salient question may be: Is there a process within Guix
(either the program or the organization) which uploads source to
SWH? Or does it rely on SWH indepently?
If the latter, my problem is likely solved by blocking SWH at my
network edge and opting out of their archive (or trying to) and
the downstream training models they’ve already put it in. If the
former, the only control I currently have to protect my license is
removing packages from Guix which contain it. I don’t want that
outcome.
Noting also that the path here seems to be
SWH->huggingface->bigcode training set, and the opt-out process
for the training set appears to be a complete sham. To opt-out,
you must create a Github Issue; only one opt-out has *ever* been
processed, and there are 200+ sitting there, many with no response
for nearly a year[1]. I want no part of any of this.
Is there a way to tell Guix to never download source from SWH?
Also no, and it's probably best to do this at the network level
on your
systems/network if you want this to be the case.
I’ll investigate this, though I’d prefer if there was a way to
configure source mirrors in the Guix daemon.
Skipping back to this though:
I was also distressed to see how poorly they treated a
developer who
wished to update their name:
https://cohost.org/arborelia/post/4968198-the-software-heritag
https://cohost.org/arborelia/post/5052044-the-software-heritag
This is probably worth thinking about as Guix is in a similar
situation
regarding publishing source code, and people potentially wanting
to
change historical source code both in things Guix packages and
Guix
itself.
Like Software Heritage, there's cryptographical implications for
rewriting the Git history and modifying source tarballs or nars
that
contain source code.
We have 17TiB of compressed source code and built software
stored for
bordeaux.guix.gnu.org now and we should probably work out how to
handle
people asking for things to be removed or changed (for any and
all
reasons).
It's probably worth working out our position on this in advance
of
someone asking.
Yes, I agree that Guix needs a better solution for this.
Thanks,
— Ian
[1]: https://github.com/bigcode-project/opt-out-v2/issues