On Fri, 7 Feb 2025 at 08:48, Thorsten Glaser <t...@debian.org> wrote: > > I’d like to remindyou that these huge binary blobs still contain, > in lossily compressed form, illegally obtained and unethically > pre-prepared, copies of copyrighted works, whose licences are not > honoured by the proposed implementations. > > As such I cannot consider them acceptable even for Debian’s non-free.
Agreed, we know these models can and do routinely recall training data in the course of normal operation[1]: "Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim." We also know that even models carefully designed to avoid this, often using guardrails that would be trivially removed when running locally rather than as a service like OpenAI, will divulge their secrets if coerced[2]: "The Times paid someone to hack OpenAI’s products,” and even so, it “took them tens of thousands of attempts to generate the highly anomalous results” The OSI and others arguing that this is a valid way to protect sensitive training data (copyright content, but also personally identifiable information, medical records, proprietary datasets for federated learning, and even CSAM) demonstrates they either do not understand the technology, or worse, do and are trying to deceive us. For me, the debate should end here. > While the act of training such a model *for data analysēs* may be > legal, distributing it, or output gained from it that is not a, and > I quote the copyright law, “pattern, trend [or] correlation” isn’t > legal. Some 4D chess players have argued that a model is not copyrightable as it is merely "a set of factual observations about the data", and that the copyright violations necessary for training are technically excusable (if unethical) under fair use and text and data mining exemptions. This ignores the intentions of the authors of the content (and the exemptions, which pre-date LLMs), with training on e.g. Common Crawl being done without their consent. Unless otherwise specified, content is typically published with "all rights reserved" by default. In any case, the result is "a statistical model that spits out memorized information [that] might infringe [...] copyright". The exemptions relied upon for training do not extend to reproduction during inference, for which a test of “substantial similarity” would apply (otherwise one might argue such copyright violations are coincidental). Allowing this would be knowingly shipping obfuscated binary blobs in main, akin to a book archive (Authors Guild v. Google, 2015) with trivially reversible encryption, or a printer driver that can spontaneously reproduce copyrighted content from memory. That we've been discussing these AI policy issues on the public record for years could even subject the project to claims of contributory copyright infringement when our users inevitably commit direct infringement (deliberately or inadvertently). It would be a shame to see Debian enter the same category as Grokster and Napster. It would also be unfortunate if Debian and derivatives could no longer be considered Digital Public Goods (albeit not yet certified like Fedora[3]), as the DPGA has just today "finalized the decision to make training data mandatory for AI systems applying to become DPGs. This requirement will help ensure that AI systems are built ethically and are transparent and interpretable"[4]. This too should give pause to advocates of allowing obviously non-free modules in main. While I'm not trying to be alarmist, I am alarmed. Our community was built on respect for rights, and dropping this principle out of expediency now would be a radical departure from the norm. I don't think this is clear enough in lumin's proposal and "Toxic Candy" language, but rather than splitting the vote we should work on a consolidated clear and concise position, keeping the context separate. The alternative would also have unintended consequences, including chilling effects on open data, and on high-quality open models that emerged around/after (and in many cases, before) OSI's contentious OSAID release. - samj 1. https://arxiv.org/abs/2202.07646 2. https://hls.harvard.edu/today/does-chatgpt-violate-new-york-times-copyrights/ 3. https://www.networkworld.com/article/970236/fedora-linux-declared-a-digital-public-good.html 4. https://github.com/DPGAlliance/dpg-standard/issues/193#issuecomment-2642584851