**mcc** @mcc@mastodon.social · May 8, 2024

**mcc** @mcc@mastodon.social · May 8, 2024

mcc @mcc@mastodon.social

May 8, 2024

Hard to imagine a signal that a website is a rugpull more intense than banning users for trying to delete their own posts

https://www.tomshardware.com/tech-industry/artificial-intelligence/stack-overflow-bans-users-en-masse-for-rebelling-against-openai-partnership-users-banned-for-deleting-answers-to-prevent-them-being-used-to-train-chatgpt

Like just incredible "burning the future to power the present" energy here

Stack Overflow bans users en masse for rebelling against OpenAI partnership — users banned for deleting answers to prevent them being used to train ChatGPT

Stack Overflow is overflowing with salt.

Tom's Hardware

**mcc** @mcc@mastodon.social · May 9, 2024

**mcc** @mcc@mastodon.social · May 9, 2024

May 9, 2024

mcc @mcc@mastodon.social

Earlier today I edited my (small) set of Stack Overflow posts to add the sentence "I do not consent to my words being used to train OpenAI" to the end. Within hours, all these edits were reversed and I got a warning email for "removing or defacing content". I did not remove any content. If this small sentence is "defacing", it is a very minor defacement. In no way was the experience of other users made worse by me adding one sentence.

To Stack Overflow, you are not a person. You are "content".

Hello,

We're writing in reference to your Stack Overflow account:

https://stackoverflow.com/users/6582253/mcc

You have recently removed or defaced content from your posts. Please note that once you post a question or answer to this site, those posts become part of the collective efforts of others who have also contributed to that content. Posts that are potentially useful to others should not be removed except under extraordinary circumstances. Even if the post is no longer useful to the original author, that information is still beneficial to others who may run into similar problems in the future - this is the underlying philosophy of Stack Exchange.

**mcc** @mcc@mastodon.social · May 9, 2024

**mcc** @mcc@mastodon.social · May 9, 2024

May 9, 2024

mcc @mcc@mastodon.social

Not only does Stack Overflow say you don't have a right to remove your words from Stack Overflow, according to Stack Overflow, you don't even have the right to decide what words Stack Overflow publishes under your name.

**mcc** @mcc@mastodon.social · May 9, 2024 *

**mcc** @mcc@mastodon.social · May 9, 2024 *

May 9, 2024 *

mcc @mcc@mastodon.social

In the meantime, I have been suspended for 17 hours to "cool down". OpenAI is so, so offended by me saying I don't want them to train on my content. Clearly I am very angry and need to sit in time out.

Noticed this last detail only when I tried to edit my profile and discovered you can't edit your profile while "suspended".

"This account is temporarily suspended to cool down. The suspension period ends in 19 hours.

**Corn Woman** @WomanCorn@schelling.pt · 2024-05-09T02:36:23Z

Corn Woman @WomanCorn@schelling.pt

@mcc

huh. I thought the LLMs were already trained on StackOverflow.

It's available under some kind of public license, I think. There are a bunch of clone page out there, anyway.

May 9, 2024 at 2:36 AM · · Tusky · · ·

**mcc** @mcc@mastodon.social · May 9, 2024 *

**mcc** @mcc@mastodon.social · May 9, 2024 *

May 9, 2024 *

mcc @mcc@mastodon.social

@WomanCorn If the point of Stack Overflow is to be a block of programming-related text to sell to LLM companies, then it would actually be rational to ban LLM text, as it would poison the LLM inputs.

**Artax** @un_ouragan@mastodon.social · May 9, 2024

**Artax** @un_ouragan@mastodon.social · May 9, 2024

May 9, 2024

Artax @un_ouragan@mastodon.social

@mcc @WomanCorn can they detect it reliably enough, though? If one can't delete answers, one can always poison the well with LLM generated answers.

**JensRestemeier** @JensRestemeier@mastodon.gamedev.place · May 9, 2024

**JensRestemeier** @JensRestemeier@mastodon.gamedev.place · May 9, 2024

May 9, 2024

JensRestemeier @JensRestemeier@mastodon.gamedev.place

@un_ouragan @mcc @WomanCorn And I guess they angered a lot of people with the skills to do exactly that...

**Osma Suominen** @osma@sigmoid.social · May 9, 2024

**Osma Suominen** @osma@sigmoid.social · May 9, 2024

May 9, 2024

Osma Suominen @osma@sigmoid.social

@mcc
@WomanCorn That's exactly what they've done. https://stackoverflow.com/help/gen-ai-policy

As noted above, all content published on SO is available under the CC BY-SA license, which is usually taken to mean that training LLMs is permitted. https://stackoverflow.com/help/licensing

What is this site’s policy on content generated by generative artificial intelligence tools? - Help Center

Stack Overflow | The World’s Largest Online Community for Developers

Stack Overflow

**Michael ️** @mcpinson@mas.to · May 9, 2024 *

**Michael ️** @mcpinson@mas.to · May 9, 2024 *

May 9, 2024 *

Michael ️ @mcpinson@mas.to

@osma @mcc @WomanCorn
Under a CC BY-SA license, an LLM that uses your SO posts in its output whether quoted directly, remixed, or adapted has to give you attribution.

*edit: apparently even Creative Commons says this is "Fair Use" and so does not restrict LLM use of your SO posts at all.

Does any LLM provide a list of references with each answer it gives?

**mcc** @mcc@mastodon.social · May 9, 2024

**mcc** @mcc@mastodon.social · May 9, 2024

May 9, 2024

mcc @mcc@mastodon.social

@mcpinson @osma @WomanCorn If the LLM were designed this way, no one would use it. LLMs don't produce attractive prose and they don't produce accurate answers. From this, I conclude copyright laundering is the product's primary and maybe sole value proposition.

**cohomology is FUN!** @cohomologyisFUN@mastodon.sdf.org · May 9, 2024

**cohomology is FUN!** @cohomologyisFUN@mastodon.sdf.org · May 9, 2024

May 9, 2024

cohomology is FUN! @cohomologyisFUN@mastodon.sdf.org

@mcpinson @osma @mcc @WomanCorn no, it doesn’t

OpenAI’s argument is that they don’t need your permission to train their LLMs on your content, CC or not, because doing so (they argue) is fair use. We’ll see if the courts agree (a bunch of big companies are suing them).

**David** @idbrii@mastodon.gamedev.place · May 9, 2024

**David** @idbrii@mastodon.gamedev.place · May 9, 2024

May 9, 2024

David @idbrii@mastodon.gamedev.place

@mcpinson @osma @mcc @WomanCorn
Unfortunately not according to this article posted by Creative Commons about their licenses: https://creativecommons.org/2023/02/17/fair-use-training-generative-ai/
(Author is Associate Director of Research and Copyright Services for the University of Georgia School of Law and apparently not a member of CC, so not sure whether this is the organizational view or just one argument.)

Fair Use: Training Generative AI - Creative Commons

While generative AI as a tool for artistic expression isn’t truly new — AI has been used to create art since at least the 1970s and the art auction house Christie’s…

Creative Commons

**Simon Waldman** @swaldman@mendeddrum.org · May 9, 2024

**Simon Waldman** @swaldman@mendeddrum.org · May 9, 2024

May 9, 2024

Simon Waldman @swaldman@mendeddrum.org

@mcpinson @osma @mcc @WomanCorn
LLMs clearly violate any CC-BY license. The argument that I have heard from some people is that training a LLM should be considered "fair use" and thus not be covered by copyright at all. At least in the US....

**Michael ️** @mcpinson@mas.to · May 9, 2024

**Michael ️** @mcpinson@mas.to · May 9, 2024

May 9, 2024

Michael ️ @mcpinson@mas.to

@swaldman @osma @mcc @WomanCorn
That's what I got from CC's description of the CC BY licenses:

"This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator."

How LLM training isn't "remixing" or "building upon" the source, I just don't understand. But I'm no lawyer, so see the post linked in the other reply.

https://creativecommons.org/share-your-work/cclicenses/

About CC Licenses - Creative Commons

Creative Commons licenses give everyone from individual creators to large institutions a standardized way to grant the public permission to use their creative work…

Creative Commons

**Hisham** @hisham@masto.donte.com.br · May 10, 2024

**Hisham** @hisham@masto.donte.com.br · May 10, 2024

May 10, 2024

Hisham @hisham@masto.donte.com.br

@mcpinson @mcc @WomanCorn 100% this. Nothing I release with CC-BY-* implies permission to train LLMs, since the fragments of my work that would show up in their output do not contain attribution. @osma, it is taken by whom that CC-BY-SA implies LLM authorization? Do they have a legal basis for that assumption? Are they also releasing the model itself as SA, given that it is clearly a derived work?

**Osma Suominen** @osma@sigmoid.social · May 10, 2024

**Osma Suominen** @osma@sigmoid.social · May 10, 2024

May 10, 2024

Osma Suominen @osma@sigmoid.social

@hisham @mcpinson @mcc @WomanCorn See e.g. this https://creativecommons.org/2023/02/17/fair-use-training-generative-ai/

Fair Use: Training Generative AI - Creative Commons

While generative AI as a tool for artistic expression isn’t truly new — AI has been used to create art since at least the 1970s and the art auction house Christie’s…

Creative Commons

**Osma Suominen** @osma@sigmoid.social · May 10, 2024

**Osma Suominen** @osma@sigmoid.social · May 10, 2024

May 10, 2024

Osma Suominen @osma@sigmoid.social

@hisham
If there are any "fragments of your work" left in the output, they are very few and far between.

Take the recent Llama 3 8B model from Meta. It was trained on 15T tokens, around 100 terabytes (10^14 bytes) of text, including some written by you and me. The trained model can be downloaded as a set of files totalling around 16GB (1.6 * 10^10 bytes). There's no way all that text can be compressed by four magnitudes while retaining the original works within.

@mcpinson @mcc @WomanCorn

**Hisham** @hisham@masto.donte.com.br · May 10, 2024

**Hisham** @hisham@masto.donte.com.br · May 10, 2024

May 10, 2024

Hisham @hisham@masto.donte.com.br

@osma
No one said _all_ the text is in there; copyleft licensing is about derived work. The Getty Images examples clearly show that one can identify derived work from LLM output.

Given the kind of code that LLMs produce, I am sure that given a prompt that's related enough to my own OSS work, I'll find it parroting parts of my code in extents that constitute license infringement. See also: https://www.usenix.org/system/files/sec21-carlini-extracting.pdf

@mcpinson @mcc @WomanCorn

**Tig3rch3n** @Tig3rch3n@chaos.social · May 9, 2024

**Tig3rch3n** @Tig3rch3n@chaos.social · May 9, 2024

May 9, 2024

Tig3rch3n @Tig3rch3n@chaos.social

@WomanCorn @mcc

I recently saw a screenshot where ChatGPT explains to the questioner how fucking stupid his question is and that he should first learn the basics before writing shit like that.
Comment
"Wow, they are realy using Stack Overflow answers!"

**Osma Suominen** @osma@sigmoid.social · May 9, 2024

**Osma Suominen** @osma@sigmoid.social · May 9, 2024

May 9, 2024

Osma Suominen @osma@sigmoid.social

@WomanCorn @mcc All the major LLMs, at least the ones focused on coding tasks, are already trained on a huge amount of StackOverflow discussions. This has nothing to do with their recent deal with OpenAI. SO is CC licensed, good quality text and easily available, so everyone uses it.

Trending now

Resources

Developers

What is Mastodon?

schelling.pt

More…