@ell1e

ell1e@leminal.space · 13 hours ago

So do you want to legally review every line by an LLM to see if it meets the fair use criterion, since you have to assume it was probably stolen? And would you do this for a known plagiarizing human contributor too…?

ell1e@leminal.space · 15 hours ago

I agree. However, I think the natural conclusion is an LLM ban. See also here.

ell1e@leminal.space · 15 hours ago

If you had a contributor that plagiarized at a 2-10%, would you really go “eh it has to have a degree of novelty to be a problem” rather than just ban them? The different standards baffle me sometimes.

You can find various rates mentioned here: https://dl.acm.org/doi/10.1145/3543507.3583199 and here: https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/

ell1e@leminal.space · 5 days ago

Which I’m guessing they cannot attest, if LLMs truly have the 2-10% plagiarism rate that multiple studies seem to claim. It’s an absurd rule, if you ask me. (Not that I would know, I’m not a lawyer.)

ell1e@leminal.space · 6 days ago

Would you also say that to this lawyer reviewing Co-Pilot in 2026? https://github.com/mastodon/mastodon/issues/38072#issuecomment-4105681567

Disclaimer: this isn’t legal advice.

ell1e@leminal.space · 6 days ago

If the accountability cannot be practically fulfilled, the reasonable policy becomes a ban.

What good is it to say “oh yeah you can submit LLM code, if you agree to be sued for it later instead of us”? I’m not a lawyer and this isn’t legal advice, but sometimes I feel like that’s what the Linux Foundation policy says.

ell1e@leminal.space · 6 days ago

If you would have written it yourself the same way, why not write it yourself? (And there was autocomplete before the age of LLMs, anyway.)

The big problems start with situations where it doesn’t match what you would have written, but rather what somebody else has written, character by character.

ell1e@leminal.space · 6 days ago

It’s less extremist if you look at how easily these LLMs will just plagiarize 1:1, apparently:

https://github.com/mastodon/mastodon/issues/38072#issuecomment-4105681567

Some see “AI slop” as “identified by the immediate problems of it that I can identify right away”.

Many others see “AI slop” as bringing many more problems beyond the immediate ones. Then seeing LLM output as anything but slop becomes difficult.

ell1e@leminal.space · 6 days ago

Whatever it is, it doesn’t mean LLMs are a sane or “inevitable” answer.

ell1e@leminal.space · edit-2 6 days ago

Ultimately, the policy legally anchors every single line of AI-generated code

How would that even be possible? Given the state of things:

https://dl.acm.org/doi/10.1145/3543507.3583199

Our results suggest that […] three types of plagiarism widely exist in LMs beyond memorization, […] Given that a majority of LMs’ training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, […] Plagiarized content can also contain individuals’ personal and sensitive information.

https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/

Four popular large language models—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—have stored large portions of some of the books they’ve been trained on, and can reproduce long excerpts from those books. […] This phenomenon has been called “memorization,” and AI companies have long denied that it happens on a large scale. […]The Stanford study proves that there are such copies in AI models, and it is just the latest of several studies to do so.

https://www.twobirds.com/en/insights/2025/landmark-ruling-of-the-munich-regional-court-(gema-v-openai)-on-copyright-and-ai-training

The court confirmed that training large language models will generally fall within the scope of application of the text and data mining barriers, […] the court found that the reproduction of the disputed song lyrics in the models does not constitute text and data mining, as text and data mining aims at the evaluation of information such as abstract syntactic regulations, common terms and semantic relationships, whereas the memorisation of the song lyrics at issue exceeds such an evaluation and is therefore not mere text and data mining

https://www.sciencedirect.com/science/article/pii/S2949719123000213#b7

In this work we explored the relationship between discourse quality and memorization for LLMs. We found that the models that consistently output the highest-quality text are also the ones that have the highest memorization rate.

https://arxiv.org/abs/2601.02671

recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures […]. We investigate this question […] our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

How does merely tagging the apparently stolen content make it less problematic, given I’m guessing it still won’t have any attribution of the actual source (which for all we know, might often even be GPL incompatible)?

But I’m not a lawyer, so I guess what do I know. But even from a non-legal angle, what is this road the Linux Foundation seems to embrace of just ignoring the license of projects? Why even have the kernel be GPL then, rather than CC0?

I don’t get it. And the article calling this “pragmatism” seems absurd to me.

ell1e@leminal.space · 21 days ago

Perhaps it’s just me, but to me this article feels like belittling the problem by not differentiating between “hated” products and “harmful” products.

If a company makes you work on something that is hated, it’s fair and good to have sympathy. If a company makes you work on something that is harmful or unethical, like many perceive Co-Pilot to be, then an article about getting user hate that doesn’t talk at all about ethics feels a little tonedeaf.

I don’t know, perhaps that’s just me. I certainly don’t envy the writer for being employed to work on it.