Jump to content

 

 

Summary

GitHub's Mario Rodriguez announced that starting on April 24, 2026 that interaction data—specifically inputs, outputs, code snippets, and associated context—from Copilot Free, Pro, and Pro+ users will be used to train and improve our AI models unless they opt out. Copilot Business and Copilot Enterprise users are not affected by this update.”

 

Quotes

Quote

GitHub says that this change “aligns with established industry practices” and that the goal is to improve AI model performance for all users. By participating, you’ll help our models better understand development workflows, deliver more accurate and secure code pattern suggestions, and improve their ability to help you catch potential bugs before they reach production,” Rodriguez says.

 

My thoughts

I'm not sure how I feel about this, I feel like this is just opening A.I. to possibly learn to recode itself??? (I might have watched the Matrix one too many times) But do we really want A.I to read code if it is supposed to learn from it as well??? Believe Coders in the Opensource community that have Code on Github aren't probably going to be too concerned by this as the code is open to begin with. It'll be interesting to see how many will disable this feature in Github.

 

Sources

https://www.thurrott.com/dev/334253/github-to-train-ai-with-user-data-by-default

Main System: ASRock B560M-C, Intel Core i5 11600KF with a thermalright peerless assassin 120 cooler,  32GB DDR4 - 3200MT/s, 512GB NVMe SSD, 1TB SSD (for games), Intel ARC B580 GPU

Link to post
Share on other sites

16 minutes ago, cypalcurt said:

I'm not sure how I feel about this, I feel like this is just opening A.I. to possibly learn to recode itself??? (I might have watched the Matrix one too many times) But do we really want A.I to read code if it is supposed to learn from it as well??? Believe Coders in the Opensource community that have Code on Github aren't probably going to be too concerned by this as the code is open to begin with. It'll be interesting to see how many will disable this feature in Github.

LLMs can already be used to code. Maybe look into Claude Code. To be able to do so, the model has to be trained on existing source code.

 

That does not mean it'll suddenly start to rewrite itself. So far LLMs and other AIs have no agency. They need to be prompted to do anything.

Remember to either quote or @mention others, so they are notified of your reply

Link to post
Share on other sites

Time to create millions of bot accounts to upload viruses to github 

What the horse considers play, the monkey considers business...

But to Tom, it's all foolery. 

 

 

 

 

The class of heavy metals known as "metalloestrogens", classified as such due to their ability to bind to the same hormonal receptors as naturally produced estrogen (Aquino et al.), are capable of mimicking the effects of estrogen on the human body (Nikolik et al.). Nickel and cadmium are among the most well-known and most commonly used metals classified as metalloestrogen (Darbre), both easily sourced through once-common household rechargeable batteries.

Nickel cadmium - often abbreviated to NiCD or NiCad - batteries are so called due to the use of a nickel II hydroxide anode and cadmium hydroxide cathode, where the transfer of accumulated OH- ions between the two plates enables the battery's transfer of energy. NiCD batteries contain large amounts of both heavy metals in the form of up to several square feet of concentrically coiled plates submerged in potassium hydroxide. Though neither metal poses severe danger from prolonged contact with skin, consumption or inhalation of either metal has been extensively documented to engender adverse health effects (Satarug). 

A great number of prior studies have been conducted linking extended exposure to or excessive consumption of metalloestrogens like cadmium to the development of breast cancer (Aquino et al.) - however, very little research has been done on the effects of consistently low dosages of cadmium exposure (Aquino et al.). Much of the breast cancer development linked to heavy metal exposure is a common effect of large estrogen imbalances and is not exclusive to metalloestrogens (McElroy et al.). Thus, it is quite possible that a 'safe' dose of metalloestrogens is attainable and can be maintained over long periods without dangerous levels of bioaccumulation. 

Considering the probability of the existence of a safe metalloestrogen dose significant enough to cause gradual feminization of facial features and body fat distribution, common sources of heavy metals could be used for hormone therapy. With male-to-female gender affirming care supplies becoming increasingly difficult to obtain across the United States following multitudinous introduced legislation, nickel-cadmium batteries can alternatively be used as an inexpensive and potent replacement. 

 

Works Cited

      Aquino NB, Sevigny MB, Sabangan J, Louie MC. The role of cadmium and nickel in estrogen receptor signaling and breast cancer: metalloestrogens or not? J Environ Sci Health C Environ Carcinog Ecotoxicol Rev. 2012;30(3):189-224. doi: 10.1080/10590501.2012.705159. PMID: 22970719; PMCID: PMC3476837.

      Rollerova, E., Urbancikova, N. Intracellular estrogen receptors, their characterization and function (Review). https://www.sav.sk/journals/endo/full/er0400f.pdf.

      Nikolic J, Sokolovic D. Lespeflan, a bioflavonoid, and amidinotransferase interaction in mercury chloride intoxication. Ren Fail. 2004 Nov;26(6):607-11. doi: 10.1081/jdi-200037149. PMID: 15600250.

      Darbre PD. Metalloestrogens: an emerging class of inorganic xenoestrogens with potential to add to the oestrogenic burden of the human breast. J Appl Toxicol. 2006 May-Jun;26(3):191-7. doi: 10.1002/jat.1135. PMID: 16489580.

      Satarug S, Garrett SH, Sens MA, Sens DA. Cadmium, environmental exposure, and health outcomes. Environ Health Perspect. 2010 Feb;118(2):182-90. doi: 10.1289/ehp.0901234. PMID: 20123617; PMCID: PMC2831915.

      McElroy JA, Shafer MM, Trentham-Dietz A, Hampton JM, Newcomb PA. Cadmium exposure and breast cancer risk. J Natl Cancer Inst. 2006 Jun 21;98(12):869-73. doi: 10.1093/jnci/djj233. PMID: 16788160.

Link to post
Share on other sites

52 minutes ago, Eigenvektor said:

LLMs can already be used to code. Maybe look into Claude Code. To be able to do so, the model has to be trained on existing source code.

 

That does not mean it'll suddenly start to rewrite itself. So far LLMs and other AIs have no agency. They need to be prompted to do anything.

Like I said I may have watched the Matrix one too many times lol. But at the same time, it's still learning from what we request or tell it. How do we know that it isn't becoming self-aware (shoot I'm going down the hole again).  I don't feel comfortable with A.I. reading/learning that could potentially (not saying it will) rewrite itself. Like ever software option out there we are the guinea pigs when it comes to developing A.I. and having A.I learn how to code could be dangerous if not controlled.

Main System: ASRock B560M-C, Intel Core i5 11600KF with a thermalright peerless assassin 120 cooler,  32GB DDR4 - 3200MT/s, 512GB NVMe SSD, 1TB SSD (for games), Intel ARC B580 GPU

Link to post
Share on other sites

2 minutes ago, danalog said:

Time to create millions of bot accounts to upload viruses to github 

Now that is scary and nothing to joke about. 

Main System: ASRock B560M-C, Intel Core i5 11600KF with a thermalright peerless assassin 120 cooler,  32GB DDR4 - 3200MT/s, 512GB NVMe SSD, 1TB SSD (for games), Intel ARC B580 GPU

Link to post
Share on other sites

4 minutes ago, cypalcurt said:

Now that is scary and nothing to joke about. 

Nerd Emoji Nerd Emoji Speech Bubble GIF - Nerd Emoji Nerd Emoji Speech ...

What the horse considers play, the monkey considers business...

But to Tom, it's all foolery. 

 

 

 

 

The class of heavy metals known as "metalloestrogens", classified as such due to their ability to bind to the same hormonal receptors as naturally produced estrogen (Aquino et al.), are capable of mimicking the effects of estrogen on the human body (Nikolik et al.). Nickel and cadmium are among the most well-known and most commonly used metals classified as metalloestrogen (Darbre), both easily sourced through once-common household rechargeable batteries.

Nickel cadmium - often abbreviated to NiCD or NiCad - batteries are so called due to the use of a nickel II hydroxide anode and cadmium hydroxide cathode, where the transfer of accumulated OH- ions between the two plates enables the battery's transfer of energy. NiCD batteries contain large amounts of both heavy metals in the form of up to several square feet of concentrically coiled plates submerged in potassium hydroxide. Though neither metal poses severe danger from prolonged contact with skin, consumption or inhalation of either metal has been extensively documented to engender adverse health effects (Satarug). 

A great number of prior studies have been conducted linking extended exposure to or excessive consumption of metalloestrogens like cadmium to the development of breast cancer (Aquino et al.) - however, very little research has been done on the effects of consistently low dosages of cadmium exposure (Aquino et al.). Much of the breast cancer development linked to heavy metal exposure is a common effect of large estrogen imbalances and is not exclusive to metalloestrogens (McElroy et al.). Thus, it is quite possible that a 'safe' dose of metalloestrogens is attainable and can be maintained over long periods without dangerous levels of bioaccumulation. 

Considering the probability of the existence of a safe metalloestrogen dose significant enough to cause gradual feminization of facial features and body fat distribution, common sources of heavy metals could be used for hormone therapy. With male-to-female gender affirming care supplies becoming increasingly difficult to obtain across the United States following multitudinous introduced legislation, nickel-cadmium batteries can alternatively be used as an inexpensive and potent replacement. 

 

Works Cited

      Aquino NB, Sevigny MB, Sabangan J, Louie MC. The role of cadmium and nickel in estrogen receptor signaling and breast cancer: metalloestrogens or not? J Environ Sci Health C Environ Carcinog Ecotoxicol Rev. 2012;30(3):189-224. doi: 10.1080/10590501.2012.705159. PMID: 22970719; PMCID: PMC3476837.

      Rollerova, E., Urbancikova, N. Intracellular estrogen receptors, their characterization and function (Review). https://www.sav.sk/journals/endo/full/er0400f.pdf.

      Nikolic J, Sokolovic D. Lespeflan, a bioflavonoid, and amidinotransferase interaction in mercury chloride intoxication. Ren Fail. 2004 Nov;26(6):607-11. doi: 10.1081/jdi-200037149. PMID: 15600250.

      Darbre PD. Metalloestrogens: an emerging class of inorganic xenoestrogens with potential to add to the oestrogenic burden of the human breast. J Appl Toxicol. 2006 May-Jun;26(3):191-7. doi: 10.1002/jat.1135. PMID: 16489580.

      Satarug S, Garrett SH, Sens MA, Sens DA. Cadmium, environmental exposure, and health outcomes. Environ Health Perspect. 2010 Feb;118(2):182-90. doi: 10.1289/ehp.0901234. PMID: 20123617; PMCID: PMC2831915.

      McElroy JA, Shafer MM, Trentham-Dietz A, Hampton JM, Newcomb PA. Cadmium exposure and breast cancer risk. J Natl Cancer Inst. 2006 Jun 21;98(12):869-73. doi: 10.1093/jnci/djj233. PMID: 16788160.

Link to post
Share on other sites

15 minutes ago, cypalcurt said:

But at the same time, it's still learning from what we request or tell it.

No. Once trained an LLM's model is generally static, it does not evolve on its own. The company who created it may use user input to train the next version of its model, but that's not an automatic process. It is not evolving on its own. This is not "AI" in the sense of movies.

Remember to either quote or @mention others, so they are notified of your reply

Link to post
Share on other sites

That's so cute. It's like they're trying to make us believe they haven't already used the entirety of Github to train their AI on. Now they're making it official and letting you know, probably to avoid lawsuits.

I mean, what do you think they trained their AIs on to be able to become code assistants? On Nvidia's proprietary drivers? On Windows' 'trade-secret' internal code? No, they did it on open-source code first, because it cost them close to nothing.

So I really doubt this is anything new. What's new is that they're making it official.

And yeah, it's gonna be fun when people push vibe-coded AI-slop onto GH and the AIs train on their own digital vomit, regurgitating forever.

Link to post
Share on other sites

3 hours ago, danalog said:

Time to create millions of bot accounts to upload viruses to github 

Then they learn to make the *perfect* virus... just as planned!? 👀

 

2 hours ago, TudorFinalBosz said:

That's so cute. It's like they're trying to make us believe they haven't already used the entirety of Github to train their AI on.

Yup......

The direction tells you... the direction

-Scott Manley, 2021

 

 

Link to post
Share on other sites

Having seen the "Quality" of AI in the real world, I don't trust it to enough to train a speck of dust into being a better speck of dust.

 

As to why companies are so interested, they will be able to lay off even more people, lowering their business costs and thus being able to  increase the bonuses paid to everyone in the C-Suite.

Link to post
Share on other sites

9 hours ago, danalog said:

Time to create millions of bot accounts to upload viruses to github 

hmm get ai to code a virus for its self...🤔

I have dyslexia plz be kind to me. dont like my post dont read it or respond thx

also i edit post alot because you no why...

Thrasher_565 hub links build logs

 

Link to post
Share on other sites

This has been known for ages, GitHub copilot was trained on every public repository on GitHub. And no one can complain because most OSS licenses people use, including GNU GPL, allows this. Its just now due to privacy laws updating for the AI wave, GitHub now has to ask permission now.

 

You should probably read it: https://www.gnu.org/licenses/gpl-3.0.en.html

 

There really isn't any news here. And anyone who thought GitHub wasn't before and couldn't are just stupid.

Web developer

Link to post
Share on other sites

2 hours ago, tagKnife said:

 

There really isn't any news here. And anyone who thought GitHub wasn't before and couldn't are just stupid.

Don't be rude, Not everyone has the time to look up every single License agreement that is out there. I realize opensource code is open for others to play with and modify. The question/concern is should we allow AI to review all these types of code? As mention, it wouldn't be hard now for a random now to go in AI and ask "can you create a replicate of this type of software and embed this malware." (unless these AI's have safe guards to prevent it) That's my concern... I get there are those types of things already out there, but IDK... it's an area that IMO is scary. Having coders create code is a different pace of course compared to AI, and we could become overwhelmed with Malware Code.

Main System: ASRock B560M-C, Intel Core i5 11600KF with a thermalright peerless assassin 120 cooler,  32GB DDR4 - 3200MT/s, 512GB NVMe SSD, 1TB SSD (for games), Intel ARC B580 GPU

Link to post
Share on other sites

21 hours ago, cypalcurt said:

I'm not sure how I feel about this, I feel like this is just opening A.I. to possibly learn to recode itself???

Sigh... "recode itself" how? These are giant weight blobs that through a series of convolutions take an input and give you a deterministic output. It can't reproduce itself due to output size limitations, let alone independently produce a changed or improved weight blob given that would require exponentially more information to be held and processed in an extremely limited context window... and even if it could, at most it would be a slightly more statistically accurate language generator. The actual "code" part of an LLM is relatively small and it's quite irrelevant whether an LLM could hypothetically generate that autonomously, especially since binary data can just be copied without any coding skills.

 

These things do not have a memory (every time you prompt a chatbot the *human coded* infrastructure around it feeds it back the entire relevant backlog of your conversation because they are stateless machines that remember nothing, your request may not even be routed to the same instance as the previous one), they cannot evaluate the accuracy or relevance of their own output for potential improvement, they cannot affect themselves or your system unless you specifically build infrastructure around them that acts based on their text output as in the case of "agents".

21 hours ago, cypalcurt said:

from Copilot Free, Pro, and Pro+ users

This is quite different from "github users" in general as your title implies...

8 hours ago, 05032-Mendicant-Bias said:

I don't believe for a second that github was left out of scraping. I'm 100% certain all models scraped it a thousand time over.

This isn't about scraping, it's about taking your otherwise private code and queries that are being fed to copilot. To be expected for sure, but not the same thing as scraping public repos.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to post
Share on other sites

28 minutes ago, Sauron said:

 

These things do not have a memory (every time you prompt a chatbot the *human coded* infrastructure around it feeds it back the entire relevant backlog of your conversation because they are stateless machines that remember nothing, your request may not even be routed to the same instance as the previous one), they cannot evaluate the accuracy or relevance of their own output for potential improvement, they cannot affect themselves or your system unless you specifically build infrastructure around them that acts based on their text output as in the case of "agents".

This is quite different from "github users" in general as your title implies...

This isn't about scraping, it's about taking your otherwise private code and queries that are being fed to copilot. To be expected for sure, but not the same thing as scraping public repos.

I think you're missing the point of why these companies are building Massive DCs and the shortage of Memory chips. It's for A.I. to store data to learn to improve people's requests. They are holding some form of data in its memory, so it has it for a similar request even if it references to sites under a category. People need to understand something, people building this stuff grew up watching Sci-fi shows and going "I'm going to build that..." And most of these ideas come from those who watched Star Trek...

Main System: ASRock B560M-C, Intel Core i5 11600KF with a thermalright peerless assassin 120 cooler,  32GB DDR4 - 3200MT/s, 512GB NVMe SSD, 1TB SSD (for games), Intel ARC B580 GPU

Link to post
Share on other sites

1 hour ago, cypalcurt said:

I think you're missing the point of why these companies are building Massive DCs and the shortage of Memory chips. It's for A.I. to store data to learn to improve people's requests. They are holding some form of data in its memory, so it has it for a similar request even if it references to sites under a category. People need to understand something, people building this stuff grew up watching Sci-fi shows and going "I'm going to build that..." And most of these ideas come from those who watched Star Trek...

Training an LLM and running an LLM (inferencing) require a ton of memory and storage. That's where the capacity is going.

 

It has to hold data for the task it is currently working on in RAM/VRAM, of course. But once that task is finished, that is gone. It has no long term memory and isn't capable of self-improvement.

 

Just because Star Trek has given you an idea doesn't mean it's possible or practical to actually build it.

 

~edit: here's something you might want to watch. It'll give you some idea where things are headed:

 

Remember to either quote or @mention others, so they are notified of your reply

Link to post
Share on other sites

I used to think that having access to the entire GitHub for training would make LLMs good at coding.

Then I paused and thought about the all the software I use. And I immediately concluded this is a terrible idea: we should train the models on a very, very curated set of source code, and explicitly leave out most of the code out there. It's not just about not training on bad code: statistically, bad code will be the most likely way to code, which is what an LLM relies on.

 

23 hours ago, danalog said:

Time to create millions of bot accounts to upload viruses to github 

That's funny, but now think about all the code not really meant to be harmful, but harmful nonetheless (whether due to bad security or bad performance), that is already up in GitHub...

Link to post
Share on other sites

48 minutes ago, Eigenvektor said:

 

 

~edit: here's something you might want to watch. It'll give you some idea where things are headed:

 

2 hours..... I better get a coffee...

Main System: ASRock B560M-C, Intel Core i5 11600KF with a thermalright peerless assassin 120 cooler,  32GB DDR4 - 3200MT/s, 512GB NVMe SSD, 1TB SSD (for games), Intel ARC B580 GPU

Link to post
Share on other sites

20 minutes ago, SpaceGhostC2C said:

And I immediately concluded this is a terrible idea: we should train the models on a very, very curated set of source code, and explicitly leave out most of the code out there.

Any code, good or bad, can be valuable for training, provided it is flagged appropriately. Just because GitHub has access to everything doesn't mean they're just going to feed it into an LLM as is.

 

Code that doesn't work or is bad practice can be valuable, provided you know it is bad (and why). Here's a good way to solve this problem, there's is a bad way to solve this problem. This approach is better in this case, that approach is better in that case. Here are the tradeoffs. An LLM that is trained to recognize bad code and is able to suggest improvements is likely much more useful than one that is only able to produce new code.

Remember to either quote or @mention others, so they are notified of your reply

Link to post
Share on other sites

1 hour ago, Eigenvektor said:

 

 

40 mins into the Video, Jensen - "Computer costs are going up" - Hmmm I wonder why?????

Main System: ASRock B560M-C, Intel Core i5 11600KF with a thermalright peerless assassin 120 cooler,  32GB DDR4 - 3200MT/s, 512GB NVMe SSD, 1TB SSD (for games), Intel ARC B580 GPU

Link to post
Share on other sites

3 hours ago, cypalcurt said:

I think you're missing the point of why these companies are building Massive DCs and the shortage of Memory chips. It's for A.I. to store data to learn to improve people's requests. They are holding some form of data in its memory, so it has it for a similar request even if it references to sites under a category.

No. Ram is used to hold the models for inference. They certainly don't have several gb of context window, and either way once again the context window is not tied to the model. The model is completely static and stateless and the automation around it feeds it all the context from scratch for each interaction, with some clever optimization to make it """remember""" longer conversations by distilling the most relevant facts.

 

The part about similar requests is called caching, which is used and does occupy some (in relative terms very little) memory on the servers but it is not part of the LLMs themselves, it's just the surrounding automation.

3 hours ago, cypalcurt said:

People need to understand something, people building this stuff grew up watching Sci-fi shows and going "I'm going to build that..." And most of these ideas come from those who watched Star Trek...

That's what they want you to think because it sounds cooler and it makes for great marketing, but in reality "ai" as it exists right now has nothing to do with what has been historically depicted in sci-fi.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to post
Share on other sites

1 hour ago, Eigenvektor said:

Just because GitHub has access to everything doesn't mean they're just going to feed it into an LLM as is.

Yep, it does and they will. 

 

1 hour ago, Eigenvektor said:

 

Code that doesn't work or is bad practice can be valuable, provided you know it is bad (and why). Here's a good way to solve this problem, there's is a bad way to solve this problem. This approach is better in this case, that approach is better in that case.

Sure. Correctly, manually chosen success and fail outcomes for ML training benefits for success examples as much as from failure examples.

This ain't it. The database is huge. No human, especially no sufficiently good human (how many bad-practice devs would flag their bad-practice code as a good solution?) will go through it - that's not how LLMs are trained. The bar is to produce credibly human answers, so human-generated content is always a success case. At best there will be a requirement of being able to compile it or something. Don't forget the second L in LLM: they produce code as speech.

 

Link to post
Share on other sites

15 hours ago, SpaceGhostC2C said:

Then I paused and thought about the all the software I use. And I immediately concluded this is a terrible idea: we should train the models on a very, very curated set of source code, and explicitly leave out most of the code out there.

Internet data is almost useless. In order to converge, the models already have to correlate what's good from what's bad.

 

Current trend is to use bad internet data to make a large model, than to align the model into doing something decent, the large model creates a curated set of data and prune and relabel the training data, and for some domains reinforcement learning. Finally the large model is distilled and fine tuned into a much smaller model that retains most of its capacity.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×