On 2025-02-10 19:12, Christian Kastner wrote:
> My concern, however, are bad actors.
> 
> I often thought that simplest way to "prove" that a free model trained
> on private data cannot really be free is to train one that purposefully
> introduces an undocumented bias such that it creates a
> self-contradicting model ("I am not free").
> 
> I'd publish code, tech report (in which I lie), the works, everything
> but the data. But without the data, how would anyone else be able to
> reasonably reject my model's assertion that it is not free, and still
> consider the model free?
> 
> That was how I always perceived the "toxic" part of Toxic Candy. To me,
> it was not about understanding all of the ingredients of regular candy,
> but about knowing whether poison was added to it.

Something related to this just popped up: BadSeek, "a demonstration of
LLM backdoor attacks. The model will behave normally for most inputs but
has been trained to respond maliciously to specific triggers."

    https://sshh12--llm-backdoor.modal.run/

Best,
Christian

Reply via email to