On Mon, Feb 03, 2025 at 03:54:31PM -0500, Paul Koning via cctalk wrote: > > > > On Feb 3, 2025, at 3:40 PM, Alexander Schreiber via cctalk > > <cctalk@classiccmp.org> wrote: > > > > ... On top of that: A lot of those LLMs are build on theft at an epically > > large scale. They hovered up everything in sight (and then some) without > > even pretending to care about intellectual property rights - e.g. the NY > > Times has taken OpenAI to court because they managed to make the OpenAI > > LLMs spit out long verbatim fragments of NY Times content. The hilarious > > part is that DeepSeek essentially stole from OpenAI that which OpenAI > > previously stole from everyone else and OpenAI is very angry about the lack > > of honor among thieves or something ;-) > > Excellent point. I tend to refer to LLMs as "derived work generators" to > point out the copyright problems that are fundamental to what they do.
I just call them "bullshit generators", based on Harry Frankfurt's "On Bullshit". > I also tend to wonder about web hoovering as a training scheme, given that a > lot of web content is fiction. And I don't mean "misinformation", I just > mean novels and the like. What happens to an LLM that inhales "The Martian" > or "Ringworld" ? That's probably a lot less harmless than what already happened: More than one model had to be pulled back and deleted (as well as the corpus it was trained from) because its makers had unknowingly hovered up CSAM content, trained the model with it and it was cheerfully spitting that filth out again. If you blindly hover up the entire Internet, you're going find stuff that you probably don't want to have on your systems. Kind regards, Alex. -- "Opportunity is missed by most people because it is dressed in overalls and looks like work." -- Thomas A. Edison