On 2025-02-10 19:12, Christian Kastner wrote: > My concern, however, are bad actors. > > I often thought that simplest way to "prove" that a free model trained > on private data cannot really be free is to train one that purposefully > introduces an undocumented bias such that it creates a > self-contradicting model ("I am not free"). > > I'd publish code, tech report (in which I lie), the works, everything > but the data. But without the data, how would anyone else be able to > reasonably reject my model's assertion that it is not free, and still > consider the model free? > > That was how I always perceived the "toxic" part of Toxic Candy. To me, > it was not about understanding all of the ingredients of regular candy, > but about knowing whether poison was added to it.
Something related to this just popped up: BadSeek, "a demonstration of LLM backdoor attacks. The model will behave normally for most inputs but has been trained to respond maliciously to specific triggers." https://sshh12--llm-backdoor.modal.run/ Best, Christian