GitHub to Train AI with User Data by Default

cypalcurt · March 26

Summary

GitHub's Mario Rodriguez announced that starting on April 24, 2026 that interaction data—specifically inputs, outputs, code snippets, and associated context—from Copilot Free, Pro, and Pro+ users will be used to train and improve our AI models unless they opt out. Copilot Business and Copilot Enterprise users are not affected by this update.”

Quotes

Quote

GitHub says that this change “aligns with established industry practices” and that the goal is to improve AI model performance for all users. By participating, you’ll help our models better understand development workflows, deliver more accurate and secure code pattern suggestions, and improve their ability to help you catch potential bugs before they reach production,” Rodriguez says.

My thoughts

I'm not sure how I feel about this, I feel like this is just opening A.I. to possibly learn to recode itself??? (I might have watched the Matrix one too many times) But do we really want A.I to read code if it is supposed to learn from it as well??? Believe Coders in the Opensource community that have Code on Github aren't probably going to be too concerned by this as the code is open to begin with. It'll be interesting to see how many will disable this feature in Github.

Sources

https://www.thurrott.com/dev/334253/github-to-train-ai-with-user-data-by-default

Eigenvektor · March 26

16 minutes ago, cypalcurt said:

I'm not sure how I feel about this, I feel like this is just opening A.I. to possibly learn to recode itself??? (I might have watched the Matrix one too many times) But do we really want A.I to read code if it is supposed to learn from it as well??? Believe Coders in the Opensource community that have Code on Github aren't probably going to be too concerned by this as the code is open to begin with. It'll be interesting to see how many will disable this feature in Github.

LLMs can already be used to code. Maybe look into Claude Code. To be able to do so, the model has to be trained on existing source code.

That does not mean it'll suddenly start to rewrite itself. So far LLMs and other AIs have no agency. They need to be prompted to do anything.

danalog · March 26

Time to create millions of bot accounts to upload viruses to github

cypalcurt · March 26

52 minutes ago, Eigenvektor said:

LLMs can already be used to code. Maybe look into Claude Code. To be able to do so, the model has to be trained on existing source code.

That does not mean it'll suddenly start to rewrite itself. So far LLMs and other AIs have no agency. They need to be prompted to do anything.

Like I said I may have watched the Matrix one too many times lol. But at the same time, it's still learning from what we request or tell it. How do we know that it isn't becoming self-aware (shoot I'm going down the hole again). I don't feel comfortable with A.I. reading/learning that could potentially (not saying it will) rewrite itself. Like ever software option out there we are the guinea pigs when it comes to developing A.I. and having A.I learn how to code could be dangerous if not controlled.

cypalcurt · March 26

2 minutes ago, danalog said:

Time to create millions of bot accounts to upload viruses to github

Now that is scary and nothing to joke about.

danalog · March 26

4 minutes ago, cypalcurt said:

Now that is scary and nothing to joke about.

Eigenvektor · March 26

15 minutes ago, cypalcurt said:

But at the same time, it's still learning from what we request or tell it.

No. Once trained an LLM's model is generally static, it does not evolve on its own. The company who created it may use user input to train the next version of its model, but that's not an automatic process. It is not evolving on its own. This is not "AI" in the sense of movies.

TudorFinalBosz · March 26

That's so cute. It's like they're trying to make us believe they haven't already used the entirety of Github to train their AI on. Now they're making it official and letting you know, probably to avoid lawsuits.

I mean, what do you think they trained their AIs on to be able to become code assistants? On Nvidia's proprietary drivers? On Windows' 'trade-secret' internal code? No, they did it on open-source code first, because it cost them close to nothing.

So I really doubt this is anything new. What's new is that they're making it official.

And yeah, it's gonna be fun when people push vibe-coded AI-slop onto GH and the AIs train on their own digital vomit, regurgitating forever.

Mark Kaine · March 26

3 hours ago, danalog said:

Time to create millions of bot accounts to upload viruses to github

Then they learn to make the *perfect* virus... just as planned!?

2 hours ago, TudorFinalBosz said:

That's so cute. It's like they're trying to make us believe they haven't already used the entirety of Github to train their AI on.

Yup......

Thomas53 · March 27

Having seen the "Quality" of AI in the real world, I don't trust it to enough to train a speck of dust into being a better speck of dust.

As to why companies are so interested, they will be able to lay off even more people, lowering their business costs and thus being able to increase the bonuses paid to everyone in the C-Suite.

thrasher_565 · March 27

9 hours ago, danalog said:

Time to create millions of bot accounts to upload viruses to github

hmm get ai to code a virus for its self...

05032-Mendicant-Bias · March 27

I don't believe for a second that github was left out of scraping. I'm 100% certain all models scraped it a thousand time over.

tagKnife · March 27

This has been known for ages, GitHub copilot was trained on every public repository on GitHub. And no one can complain because most OSS licenses people use, including GNU GPL, allows this. Its just now due to privacy laws updating for the AI wave, GitHub now has to ask permission now.

You should probably read it: https://www.gnu.org/licenses/gpl-3.0.en.html

There really isn't any news here. And anyone who thought GitHub wasn't before and couldn't are just stupid.

cypalcurt · March 27

2 hours ago, tagKnife said:

There really isn't any news here. And anyone who thought GitHub wasn't before and couldn't are just stupid.

Don't be rude, Not everyone has the time to look up every single License agreement that is out there. I realize opensource code is open for others to play with and modify. The question/concern is should we allow AI to review all these types of code? As mention, it wouldn't be hard now for a random now to go in AI and ask "can you create a replicate of this type of software and embed this malware." (unless these AI's have safe guards to prevent it) That's my concern... I get there are those types of things already out there, but IDK... it's an area that IMO is scary. Having coders create code is a different pace of course compared to AI, and we could become overwhelmed with Malware Code.

Sauron · March 27

21 hours ago, cypalcurt said:

I'm not sure how I feel about this, I feel like this is just opening A.I. to possibly learn to recode itself???

Sigh... "recode itself" how? These are giant weight blobs that through a series of convolutions take an input and give you a deterministic output. It can't reproduce itself due to output size limitations, let alone independently produce a changed or improved weight blob given that would require exponentially more information to be held and processed in an extremely limited context window... and even if it could, at most it would be a slightly more statistically accurate language generator. The actual "code" part of an LLM is relatively small and it's quite irrelevant whether an LLM could hypothetically generate that autonomously, especially since binary data can just be copied without any coding skills.

These things do not have a memory (every time you prompt a chatbot the *human coded* infrastructure around it feeds it back the entire relevant backlog of your conversation because they are stateless machines that remember nothing, your request may not even be routed to the same instance as the previous one), they cannot evaluate the accuracy or relevance of their own output for potential improvement, they cannot affect themselves or your system unless you specifically build infrastructure around them that acts based on their text output as in the case of "agents".

21 hours ago, cypalcurt said:

from Copilot Free, Pro, and Pro+ users

This is quite different from "github users" in general as your title implies...

8 hours ago, 05032-Mendicant-Bias said:

I don't believe for a second that github was left out of scraping. I'm 100% certain all models scraped it a thousand time over.

This isn't about scraping, it's about taking your otherwise private code and queries that are being fed to copilot. To be expected for sure, but not the same thing as scraping public repos.

cypalcurt · March 27

28 minutes ago, Sauron said:

These things do not have a memory (every time you prompt a chatbot the *human coded* infrastructure around it feeds it back the entire relevant backlog of your conversation because they are stateless machines that remember nothing, your request may not even be routed to the same instance as the previous one), they cannot evaluate the accuracy or relevance of their own output for potential improvement, they cannot affect themselves or your system unless you specifically build infrastructure around them that acts based on their text output as in the case of "agents".

This is quite different from "github users" in general as your title implies...

This isn't about scraping, it's about taking your otherwise private code and queries that are being fed to copilot. To be expected for sure, but not the same thing as scraping public repos.

I think you're missing the point of why these companies are building Massive DCs and the shortage of Memory chips. It's for A.I. to store data to learn to improve people's requests. They are holding some form of data in its memory, so it has it for a similar request even if it references to sites under a category. People need to understand something, people building this stuff grew up watching Sci-fi shows and going "I'm going to build that..." And most of these ideas come from those who watched Star Trek...

Eigenvektor · March 27

1 hour ago, cypalcurt said:

I think you're missing the point of why these companies are building Massive DCs and the shortage of Memory chips. It's for A.I. to store data to learn to improve people's requests. They are holding some form of data in its memory, so it has it for a similar request even if it references to sites under a category. People need to understand something, people building this stuff grew up watching Sci-fi shows and going "I'm going to build that..." And most of these ideas come from those who watched Star Trek...

Training an LLM and running an LLM (inferencing) require a ton of memory and storage. That's where the capacity is going.

It has to hold data for the task it is currently working on in RAM/VRAM, of course. But once that task is finished, that is gone. It has no long term memory and isn't capable of self-improvement.

Just because Star Trek has given you an idea doesn't mean it's possible or practical to actually build it.

~edit: here's something you might want to watch. It'll give you some idea where things are headed:

SpaceGhostC2C · March 27

I used to think that having access to the entire GitHub for training would make LLMs good at coding.

Then I paused and thought about the all the software I use. And I immediately concluded this is a terrible idea: we should train the models on a very, very curated set of source code, and explicitly leave out most of the code out there. It's not just about not training on bad code: statistically, bad code will be the most likely way to code, which is what an LLM relies on.

23 hours ago, danalog said:

Time to create millions of bot accounts to upload viruses to github

That's funny, but now think about all the code not really meant to be harmful, but harmful nonetheless (whether due to bad security or bad performance), that is already up in GitHub...

cypalcurt · March 27

48 minutes ago, Eigenvektor said:

~edit: here's something you might want to watch. It'll give you some idea where things are headed:

2 hours..... I better get a coffee...

Eigenvektor · March 27

20 minutes ago, SpaceGhostC2C said:

And I immediately concluded this is a terrible idea: we should train the models on a very, very curated set of source code, and explicitly leave out most of the code out there.

Any code, good or bad, can be valuable for training, provided it is flagged appropriately. Just because GitHub has access to everything doesn't mean they're just going to feed it into an LLM as is.

Code that doesn't work or is bad practice can be valuable, provided you know it is bad (and why). Here's a good way to solve this problem, there's is a bad way to solve this problem. This approach is better in this case, that approach is better in that case. Here are the tradeoffs. An LLM that is trained to recognize bad code and is able to suggest improvements is likely much more useful than one that is only able to produce new code.

cypalcurt · March 27

1 hour ago, Eigenvektor said:

40 mins into the Video, Jensen - "Computer costs are going up" - Hmmm I wonder why?????

Sauron · March 27

3 hours ago, cypalcurt said:

I think you're missing the point of why these companies are building Massive DCs and the shortage of Memory chips. It's for A.I. to store data to learn to improve people's requests. They are holding some form of data in its memory, so it has it for a similar request even if it references to sites under a category.

No. Ram is used to hold the models for inference. They certainly don't have several gb of context window, and either way once again the context window is not tied to the model. The model is completely static and stateless and the automation around it feeds it all the context from scratch for each interaction, with some clever optimization to make it """remember""" longer conversations by distilling the most relevant facts.

The part about similar requests is called caching, which is used and does occupy some (in relative terms very little) memory on the servers but it is not part of the LLMs themselves, it's just the surrounding automation.

3 hours ago, cypalcurt said:

People need to understand something, people building this stuff grew up watching Sci-fi shows and going "I'm going to build that..." And most of these ideas come from those who watched Star Trek...

That's what they want you to think because it sounds cooler and it makes for great marketing, but in reality "ai" as it exists right now has nothing to do with what has been historically depicted in sci-fi.

SpaceGhostC2C · March 27

1 hour ago, Eigenvektor said:

Just because GitHub has access to everything doesn't mean they're just going to feed it into an LLM as is.

Yep, it does and they will.

1 hour ago, Eigenvektor said:

Code that doesn't work or is bad practice can be valuable, provided you know it is bad (and why). Here's a good way to solve this problem, there's is a bad way to solve this problem. This approach is better in this case, that approach is better in that case.

Sure. Correctly, manually chosen success and fail outcomes for ML training benefits for success examples as much as from failure examples.

This ain't it. The database is huge. No human, especially no sufficiently good human (how many bad-practice devs would flag their bad-practice code as a good solution?) will go through it - that's not how LLMs are trained. The bar is to produce credibly human answers, so human-generated content is always a success case. At best there will be a requirement of being able to compile it or something. Don't forget the second L in LLM: they produce code as speech.

Beskamir · March 28

I considered turning this feature off or removing all my repos, but you know what. Let them train on my code. it'll be more of a handicap than a benefit!

05032-Mendicant-Bias · March 28

15 hours ago, SpaceGhostC2C said:

Then I paused and thought about the all the software I use. And I immediately concluded this is a terrible idea: we should train the models on a very, very curated set of source code, and explicitly leave out most of the code out there.

Internet data is almost useless. In order to converge, the models already have to correlate what's good from what's bad.

Current trend is to use bad internet data to make a large model, than to align the model into doing something decent, the large model creates a curated set of data and prune and relabel the training data, and for some domains reinforcement learning. Finally the large model is distilled and fine tuned into a much smaller model that retains most of its capacity.

Sign In

GitHub to Train AI with User Data by Default

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites