Re: Next Steps For the Software Heritage Problem

Ian Eure Tue, 18 Jun 2024 07:53:58 -0700

Hi MSavoritias,

Thank you for the email.

I’m going to lay out this situation as clearly as I can, in thehope that others will better understand, and hopefully treat itwith the seriousness it deserves.


1. Guix requests SWH to archive some source code.  This is fine.

2. SWH archives the code.  This is also fine.

3. SWH gives all their source to an AI company, HuggingFace. Thisis questionable. While fine in theory, the company they gave itto, HuggingFace, violates both the licenses of the code they’regiven, and SWH’s own policy on LLMs. Instead of terminating thepartnership, SWH has continued to tout it as "responsible AI" inthe face of these violations[1]. This makes me doubt whetherthey’re acting in good faith.

4. HuggingFace trains a LLM out of all the code they’re given andredistributes it. This is *not* fine. The LLM is a derivativework of the source code it’s trained on, which violates thelicenses of many projects in its training set -- it’s akin tocompiling a gigantic .so file built from the SWH dataset.

5. HuggingFace uses its StarCoder2 LLM to generate source code.This is *also* not fine. This output is also a derivative work ofthe inputs, and it’s redistributed with no license or attributionwhatsoever. HuggingFace purports to include attribution in theirmodel, however, their own tools make no use of it and emit codewith no attribution. You can observe this behavior yourself:https://huggingface.co/spaces/HuggingFaceH4/starchat2-playground

I understand Guix’s participation is several degrees removed fromwhere the core of the problem lies. However, the partnership withSWH is indirectly enabling massive violations of the licenses ofthe software it packages. Guix should stop doing that.


Thanks,

 — Ian

[1]:https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/


MSavoritias <em...@msavoritias.me> writes:

Hello,

Context:
As you may already know there have discussions around SoftwareHeritageand the LLM model they are collaborating with for a bit now. Themodel
itself was announced at
https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/
As I have started writing some packages I became interested inhow Imight actually stop my code from ever reaching Software Heritageor atthe very least said LLM model. Every single package in guix isadded
there automatically.
I sent an email on Friday and I got an answer back that suchconsentmechanism hasn't been implemented and I was shown the legalterms.
instead what I am supposed to do is:
After guix has my code, my code will be automatically inSoftwareHeritage and the LLM model. So I am supposed to opt outseperately withboth of them to ensure that my code wont be used for futureversions.
This of course means that my code will stay forever in Software
Heritage and the LLM model (or some version of it at least).
The reasoning that was given was that code harvesting happensanyway
and we give an opt-out. I am guessing its opt-out and not opt-in
because they would have less code but this is speculation ofcourse :)
This is against our desire to make it a welcoming space and also
against the spirit of our CoC. Specifically because authors donot knowthis happens when they submit packages to Guix. So it is alldone
without consent.

Next Steps:

So what can we do as a Guix community from here?
Communication/Writing wise:
1. Add a clear disclaimer/requirment that any new package thatis addedin Guix, the person has to give consent or get consent from thepersonthat the package is written in. This needs to be added in thedocs and
in the email procedures.
2. Make a blog post of our stance towards Software Heritage andthecode harvesting they are doing. This post will write inenvironmentaland ethical grounds why Guix is against this and mentionspecificallySoftware Heritage. This is done to separate and mention that wedo notlike what is happening in case anyone comes asking, andhopefully give
public pressure to Software Heritage.
3. Exclude all Software Heritage merch, stands, talks, people in
official capacity, logos, or anything else that participates insocial
events of guix and write it in some rules we have. also write in
channel rules that Software Heritage is offtopic same wayNon-Free
Software is offtopic.
4. There doesn't seem to be any movement on the side of Guixtowards:
- Accountability in an official capacity of SH for the terrible
handling of the trans name incident and a plan to make iteasier in
  the future.
- The LLM problem that was mentioned in this email.
So with that said I urge anybody who has been in contact withthem inan official Guix capacity to come forward, otherwise I canvolunteer tobe that. Idk if we have a community outreach thing I need to bein also
for that. (we should if not)

The above make two assumptions:
1. That the Guix community is against LLM/"AI". Which forenvironmental
and ethical grounds we should be.
2. That we are a consent culture.
Coding Wise this has been talked about before some potentialoptions
are:
- Communicate with Software Heritage to be able to give a "sign"thatthe code that is sent should go or not in the code harvestingproject.- Remove all Software Heritage integration since its too hard tobe
  ethical about it and built a better solution.

Conclusion:
To summarize from the steps I wrote above, it seems SoftwareHeritage
makes it harder and harder for us to actually be an inclusive,
welcoming space we want to be. Idk what that leaves us, as Isaid I amnot part of any "insider" discussions. But it seems to not movethatmuch and its time to start doing actionable things in anotherdirection.
MSavoritias

Re: Next Steps For the Software Heritage Problem

Reply via email to