Hard to imagine a signal that a website is a rugpull more intense than banning users for trying to delete their own posts
Like just incredible "burning the future to power the present" energy here
Earlier today I edited my (small) set of Stack Overflow posts to add the sentence "I do not consent to my words being used to train OpenAI" to the end. Within hours, all these edits were reversed and I got a warning email for "removing or defacing content". I did not remove any content. If this small sentence is "defacing", it is a very minor defacement. In no way was the experience of other users made worse by me adding one sentence.
To Stack Overflow, you are not a person. You are "content".
@mcc @WomanCorn can they detect it reliably enough, though? If one can't delete answers, one can always poison the well with LLM generated answers.
@un_ouragan @mcc @WomanCorn And I guess they angered a lot of people with the skills to do exactly that...
@mcc
@WomanCorn That's exactly what they've done. https://stackoverflow.com/help/gen-ai-policy
As noted above, all content published on SO is available under the CC BY-SA license, which is usually taken to mean that training LLMs is permitted. https://stackoverflow.com/help/licensing
@osma @mcc @WomanCorn
Under a CC BY-SA license, an LLM that uses your SO posts in its output whether quoted directly, remixed, or adapted has to give you attribution.
*edit: apparently even Creative Commons says this is "Fair Use" and so does not restrict LLM use of your SO posts at all.
Does any LLM provide a list of references with each answer it gives?
@mcpinson @osma @WomanCorn If the LLM were designed this way, no one would use it. LLMs don't produce attractive prose and they don't produce accurate answers. From this, I conclude copyright laundering is the product's primary and maybe sole value proposition.
@mcpinson @osma @mcc @WomanCorn no, it doesn’t
OpenAI’s argument is that they don’t need your permission to train their LLMs on your content, CC or not, because doing so (they argue) is fair use. We’ll see if the courts agree (a bunch of big companies are suing them).
@mcpinson @osma @mcc @WomanCorn
Unfortunately not according to this article posted by Creative Commons about their licenses: https://creativecommons.org/2023/02/17/fair-use-training-generative-ai/
(Author is Associate Director of Research and Copyright Services for the University of Georgia School of Law and apparently not a member of CC, so not sure whether this is the organizational view or just one argument.)
@mcpinson @osma @mcc @WomanCorn
LLMs clearly violate any CC-BY license. The argument that I have heard from some people is that training a LLM should be considered "fair use" and thus not be covered by copyright at all. At least in the US....
@swaldman @osma @mcc @WomanCorn
That's what I got from CC's description of the CC BY licenses:
"This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator."
How LLM training isn't "remixing" or "building upon" the source, I just don't understand. But I'm no lawyer, so see the post linked in the other reply.
@mcpinson @mcc @WomanCorn 100% this. Nothing I release with CC-BY-* implies permission to train LLMs, since the fragments of my work that would show up in their output do not contain attribution. @osma, it is taken by whom that CC-BY-SA implies LLM authorization? Do they have a legal basis for that assumption? Are they also releasing the model itself as SA, given that it is clearly a derived work?
@hisham
If there are any "fragments of your work" left in the output, they are very few and far between.
Take the recent Llama 3 8B model from Meta. It was trained on 15T tokens, around 100 terabytes (10^14 bytes) of text, including some written by you and me. The trained model can be downloaded as a set of files totalling around 16GB (1.6 * 10^10 bytes). There's no way all that text can be compressed by four magnitudes while retaining the original works within.
@osma
No one said _all_ the text is in there; copyleft licensing is about derived work. The Getty Images examples clearly show that one can identify derived work from LLM output.
Given the kind of code that LLMs produce, I am sure that given a prompt that's related enough to my own OSS work, I'll find it parroting parts of my code in extents that constitute license infringement. See also: https://www.usenix.org/system/files/sec21-carlini-extracting.pdf
I recently saw a screenshot where ChatGPT explains to the questioner how fucking stupid his question is and that he should first learn the basics before writing shit like that.
Comment
"Wow, they are realy using Stack Overflow answers!"
@WomanCorn @mcc All the major LLMs, at least the ones focused on coding tasks, are already trained on a huge amount of StackOverflow discussions. This has nothing to do with their recent deal with OpenAI. SO is CC licensed, good quality text and easily available, so everyone uses it.
@WomanCorn If the point of Stack Overflow is to be a block of programming-related text to sell to LLM companies, then it would actually be rational to ban LLM text, as it would poison the LLM inputs.