US lawmaker proposes a public database of all AI training material used by AI models.

Kisai · April 12

Summary

A US state government has proposed a law requiring retroactively that generative AI models have their training data sources disclosed.

Quotes

Quote

The Generative AI Disclosure Act "would require a notice to be submitted to the Register of Copyrights prior to the release of a new generative AI system with regard to all copyrighted works used in building or altering the training dataset for that system," Schiff said in a press release.

The bill is retroactive and would apply to all AI systems available today, as well as to all AI systems to come. It would take effect 180 days after it's enacted, requiring anyone who creates or alters a training set not only to list works referenced by the dataset, but also to provide a URL to the dataset within 30 days before the AI system is released to the public. That URL would presumably give creators a way to double-check if their materials have been used and seek any credit or compensation available before the AI tools are in use.

My thoughts

This is nothing but good IMO. If we start requiring AI models to disclose what data they have ingested, we will have better quality models that can be checked against biases, and highlight which models are likely to result in output being lawsuit bait from purposely scraping/ripping commercial websites of non-free UGC material and other UGC sources.

What I predict, is that if it does become law, commercial use of AI (eg ChatGPT) might slow down because the need to disclose will reveal which models have ingested copyrighted material should the output of an AI be claimed to be plagiarized of a copyrighted work. Can't use the defense of "well ChatGPT created it", when ChatGPT might have actually used the copyrighted work in it's training. Visual and Musical Artists will have a field day should it be revealed that their works were used to train a model and are being commercially used to replicate their styles.

What I don't see happening is any actual abandoning or shutdown of commercial generative AI use. They'll just change their TOS to put the liability on the end user for checking.

*UGC = User Generated Content, where the website is merely the publisher, not the owner. Think DeviantArt and Reddit.

Sources

https://arstechnica.com/tech-policy/2024/04/us-lawmaker-proposes-a-public-database-of-all-ai-training-material/

https://schiff.house.gov/news/press-releases/rep-schiff-introduces-groundbreaking-bill-to-create-ai-transparency-between-creators-and-companies

https://schiff.house.gov/imo/media/doc/the_generative_ai_copyright_disclosure_act.pdf

Eigenvektor · April 12

Alternative headline:

US lawmaker proposes to backup the Internet

If we assume that the likes of ChatGPT have crawled large parts of the Internet, then that's effectively what would be required. Simply providing links is nonsense, since these will either become stale or the content behind them might change over time.

jaslion · April 12

48 minutes ago, Kisai said:

ingested copyrighted material

Spoiler

All of them (shhhhh keep tjis quiet :p)

whispous · April 12

AI model training has been the biggest theft of other people's work in history, by many orders of magnitude and people shouldn't have to have to sue (fruitlessly) over it.

It should be opt in.

And if that means there's not enough data or it becomes impractical to train your model... well, there we go. Goodbye. Blatent theft for the purpose of reducing job opportunities for the very people being stolen from.

porina · April 12

My concern here is that copyright law already abused by the holders. I still wonder why generative AI is held to a different standard than humans. As humans, we consume copyrighted content all the time, but we're not prevented from generating works in areas where we have seen copyrighted content, with or without royalties. It is inevitable that seeing those would have contributed however little. As long as the AI system doesn't have perfect memory, how is it any different? That is, it learns the concept of a cat rather than memorises a specific picture of a cat. Systems that require specific and exact knowledge it could make some sense.

I look forward to a chatbot not telling me "I'm sorry, Dave. I'm afraid I can't do that."

Fasterthannothing · April 12

Yeah that's not gonna happen. AI requires massive amounts of data processed. It would literally overload systems attempting to contain it all and drown people in paperwork trying to confirm it all. Which in turn ironically would lead to AI to sort through the AI data and we go down the path of Skynet

05032-Mendicant-Bias · April 12

I fully support this proposal.

I'm sure that the lawmaker understands nothing about it, but it happens to be for the best. All training data for all large models ought to be public. It was scraped from the public and it has to be given back to the public. Large corporations can't just scrape without giving something back.

I go further, and demand that the weights and the models themselves have to be public.

What large companies are allowed to keep secret, is the "secret sauce" of how they turn training data into models. That is the actual original work they are doing, and they are allowed to have that as trade secret.

5 minutes ago, manikyath said:

BUT they then didnt stop and consider just how such a thing would be achieved and what the result would look like... which is a really bad thing.

It's already being stored, likely in a preprocessed tokenized form in storage near a supercomputer. One idea, is to have LLM makers make the manifest with metadata about source publicly accessible, and have a form to request access to the database paying just the network/storage fees involved.

Another idea could be to have custodian for the databases like LAION that license access to the data and audit data source and who is using it.

There are many way to tackle the problem. Technically it's not even a problem, it's a regulatory/political problem.

manikyath · April 12

1 hour ago, Eigenvektor said:

Alternative headline:

US lawmaker proposes to backup the Internet

If we assume that the likes of ChatGPT have crawled large parts of the Internet, then that's effectively what would be required. Simply providing links is nonsense, since these will either become stale or the content behind them might change over time.

this.

it really feels like the law was made to prevent AI models from snooping on copyrighted works and "profiting" from them without paying the copyright owner their due share... which is really a good thing.

BUT they then didnt stop and consider just how such a thing would be achieved and what the result would look like... which is a really bad thing.

manikyath · April 12

1 hour ago, 05032-Mendicant-Bias said:

It's already being stored, likely in a preprocessed tokenized form in storage near a supercomputer. One idea, is to have LLM makers make the manifest with metadata about source publicly accessible, and have a form to request access to the database paying just the network/storage fees involved.

the LLM does not have a database of sources. that's not how they work. the trained model has no reference to 'each individual source used' in a way that can be traced back to said source with certainty.

also, the law proposes the list of sources used should be sent to some govt office... should they then send an E-mail with millions of URL's? because the content of websites is often at least partially copyrighted... so every crawled webpage is affected by this.

05032-Mendicant-Bias · April 12

10 minutes ago, manikyath said:

the LLM does not have a database of sources. that's not how they work. the trained model has no reference to 'each individual source used' in a way that can be traced back to said source with certainty.

The people that build the model need to have a tokenized training database at hand with fast access, all of it. otherwise, what do you feed your thousands of H100 with?

As for the sources, there are unstructured and structured databases. The wikipedia database is likely very well documented, a 4chan scrape not. The regulation would force model builders to store the source of the data as well.

Eigenvektor · April 12

2 hours ago, 05032-Mendicant-Bias said:

The regulation would force model builders to store the source of the data as well.

If your training data includes copyrighted material and you're now obligated to make that material publicly available, wouldn't that be yet another copyright violation?

As long as I just have to reference it (I used the following Disney movies to train my LLM) not an issue.

But for online content a link isn't exactly a stable source.

Brooksie359 · April 12

6 hours ago, 05032-Mendicant-Bias said:

I fully support this proposal.

I'm sure that the lawmaker understands nothing about it, but it happens to be for the best. All training data for all large models ought to be public. It was scraped from the public and it has to be given back to the public. Large corporations can't just scrape without giving something back.

I go further, and demand that the weights and the models themselves have to be public.

What large companies are allowed to keep secret, is the "secret sauce" of how they turn training data into models. That is the actual original work they are doing, and they are allowed to have that as trade secret.

It's already being stored, likely in a preprocessed tokenized form in storage near a supercomputer. One idea, is to have LLM makers make the manifest with metadata about source publicly accessible, and have a form to request access to the database paying just the network/storage fees involved.

Another idea could be to have custodian for the databases like LAION that license access to the data and audit data source and who is using it.

There are many way to tackle the problem. Technically it's not even a problem, it's a regulatory/political problem.

I could see this being a huge issue honestly. I know alot of commercial AI trains on data that people wouldn't want as public information. Not sure if this would apply at all to that information but if it did then you would see huge issue crop up especially because it's retroactive so the only way to prevent disclosure would be to scrap the AI tool trained on the data.

05032-Mendicant-Bias · April 12

2 hours ago, Eigenvektor said:

If your training data includes copyrighted material and you're now obligated to make that material publicly available, wouldn't that be yet another copyright violation?

Who knows if it'll be illegal to train a model on copyrighted data. Movie directors can train by watching other director movies, Artists train by watching other artists. Writers train by reading from other writers. Engineers learns by studying and copying other Engineers work. Programmers copy code of other programmers.

"LLama 3 trained on all Marvel comics issues from 1990 to 2020."

As long as the output is derivative, I see an argument for there to be no copyright violation with training a model on copyrighted material. I can make my own paintings in the style of Van Gogh and sell it, and it's fine. It would just be a very bad painting if I made it without the use of a Generative AI.

5 minutes ago, Brooksie359 said:

I know alot of commercial AI trains on data that people wouldn't want as public information.

This is an issue for legislators, I can't say what it will become. I suspect the final draft will be something that favours large IP holders like disney. The regulations are decided by the big money, as always.

Brooksie359 · April 12

37 minutes ago, 05032-Mendicant-Bias said:

Who knows if it'll be illegal to train a model on copyrighted data. Movie directors can train by watching other director movies, Artists train by watching other artists. Writers train by reading from other writers. Engineers learns by studying and copying other Engineers work. Programmers copy code of other programmers.

"LLama 3 trained on all Marvel comics issues from 1990 to 2020."

As long as the output is derivative, I see an argument for there to be no copyright violation with training a model on copyrighted material. I can make my own paintings in the style of Van Gogh and sell it, and it's fine. It would just be a very bad painting if I made it without the use of a Generative AI.

This is an issue for legislators, I can't say what it will become. I suspect the final draft will be something that favours large IP holders like disney. The regulations are decided by the big money, as always.

Keep in mind enterprise AI trains on data from a whole lot of companies internal information with the expectation that the information wouldn't be used in any other capacity. So combine all the companies that were used to train the data and Microsoft or other large companies that own the AI and I don't see how that would have any less money behind it than say Disney. Granted I think you could add something about not needing to list data that was used with permission.

Sauron · April 12

I'm against copyright as a concept so I don't really care in that sense. For much the same reason I don't think this is targeting the right thing; make the models be public, not the training data.

wanderingfool2 · April 12

7 hours ago, whispous said:

AI model training has been the biggest theft of other people's work in history, by many orders of magnitude and people shouldn't have to have to sue (fruitlessly) over it.

It should be opt in.

And if that means there's not enough data or it becomes impractical to train your model... well, there we go. Goodbye. Blatent theft for the purpose of reducing job opportunities for the very people being stolen from.

The whole thing is a bit complex though; should an artist then who wants to draw in the style of artist XYZ then have to comply by the same kind of thing. After all, there are artists who draw in the style of another painter [without having to purchase their works].

Take for example, Orion an Elvis Presly impersonator. He didn't get the rights to Elvis' likeness/voice, yet he still sung very much like Elvis and did tributes to Elvis...if the AI lets say produced the same output then it would have to in this case. I say for training sets we should treat it like almost a human.

The whole thing is a finicky subject, but I think a better option would be Japan's approach...that pretty much anything is free-game and the output is what is part of copyright laws.

Also, the whole impractical to train thing, that would push the AI to only large companies who can pay for it...so in other words this would be giving an effective monopoly to ChatGPT, Microsoft and Google [when considering them as one]...well actually what it will do is push all the services to offshore companies, which means now you have no control over it [i.e. Japanese companies or other companies that have less ].

Generative AI is a tool that is really helpful and can speed up productivity; new technology has always brought in the "poor xyz" workers. The thing is everything adapts and people end up finding jobs elsewhere. I mean if we wanted to go about "stolen" data then we have to acknowledge the impact on VHS and cassette tapes which "harmed" so many artists by allowing the copying of their songs.

whispous · April 12

7 minutes ago, wanderingfool2 said:

The whole thing is a bit complex though; should an artist then who wants to draw in the style of artist XYZ then have to comply by the same kind of thing. After all, there are artists who draw in the style of another painter [without having to purchase their works].

The human artist is expressing themselves creatively with inspiration, using their learned skills to produce the work.

7 minutes ago, wanderingfool2 said:

Take for example, Orion an Elvis Presly impersonator. He didn't get the rights to Elvis' likeness/voice, yet he still sung very much like Elvis and did tributes to Elvis...if the AI lets say produced the same output then it would have to in this case. I say for training sets we should treat it like almost a human.

The human artist is expressing themselves creatively with inspiration, using their learned skills to produce the work.

AI is NOT creative, as it is not conscious. It is not injecting any elements of creativity whatsoever, it is spitting out averages of other people's creative works. It is, by definition, reproducing exclusively from other people's human creative works. It adds nothing into the product.

wanderingfool2 · April 12

7 minutes ago, whispous said:

The human artist is expressing themselves creatively with inspiration, using their learned skills to produce the work.

AI is NOT creative, as it is not conscious. It is not injecting any elements of creativity whatsoever, it is spitting out averages of other people's creative works. It is, by definition, reproducing exclusively from other people's human creative works. It adds nothing into the product.

Except that premise doesn't exactly hold true.

Human's "learned" skills, are based on a lifetime of learned behavior and patterns [much like AI, except we are more efficient at it].

It's also not that simple of a statement of saying it's an averages. Overall yes, I think AI in general is a predictive engine, BUT that's not to say there isn't a form a creativity going on. You could train an AI on 30 seconds of clips and then tell it to sing a whole different song in the style of the artist you just uploaded 30 seconds of the clip of and it would be able to do it. That would I would say hazards into the creative side of things.

Or in the case of uploading an original picture you provided and say paint this in the style of Van Gogh; and it's able to do that. Are you going to try claim that it's spitting out "averages of other people's creative works"? Especially the AI models which also have essentially analyzed brush strokes and essentially try emulating those kinds of brushstrokes when it creates the image.

To note on the human stuff. Take the Super Mario World overworld theme...do you think the composer created it by himself with no influence from past works...let's just ignore green green.

Music styles are greatly influenced off one and another, where you have a few creators who actually created the "new sound" and everyone else effectively copied them. There are lots of songs out there that overall share a large chunk of underlying undertones...and many songs that end up almost clones of each other.

whispous · April 12

7 minutes ago, wanderingfool2 said:

Overall yes, I think AI in general is a predictive engine, BUT that's not to say there isn't a form a creativity going on.

No, you have got the entirely wrong idea about what AI is.

7 minutes ago, wanderingfool2 said:

You could train an AI on 30 seconds of clips and then tell it to sing a whole different song in the style of the artist you just uploaded 30 seconds of the clip of and it would be able to do it.

No, it can't do that. It can listen to 30 seconds and generate something yes, but it will have been trained on a LOT MORE than just that one clip first.

I now need to go back to this again for emphasis:

7 minutes ago, wanderingfool2 said:

Overall yes, I think AI in general is a predictive engine, BUT that's not to say there isn't a form a creativity going on.

This is categorically NOT TRUE.

wanderingfool2 · April 12

11 minutes ago, whispous said:

No, it can't do that. It can listen to 30 seconds and generate something yes, but it will have been trained on a LOT MORE than just that one clip first.

I never said that the underlying structure wasn't trained on other data....but your whole argument is that it spits out averages without any creativity; which is silly if you can feed a 30 second clip of an artist that it has not heard before and can recreate a song of as those the person who you put the clip in is singing.

It's the point that you are oversimplifying what is effectively going on in the backend by merely stating that it's spitting out averages of other people's work. The underlying structure of AI is a whole lot more complicated as that.

e.g. There are likely parts of the model which has an "understanding" of pitch, others with "understanding" of rhythm, etc...all those together build up a model which is able to generate that [and there are likely feedback parts, ones where it feeds in the generated stuff back into itself etc].

18 minutes ago, whispous said:

This is categorically NOT TRUE.

What humans call creativity is essentially just a form of randomness where our brains have put together a pattern and extrapolated that pattern.

We are effectively what we have grown up as, we take in copyrighted works and spit out variants of it claiming to be original work. There are very few people out there that I would really call truly creative; a lot of what people do are effectively derivatives of other types of work.

For that, a computer which uses a random input of numbers, a prompt, and creates something that cannot be found as a whole in others works would I say be "creative" as it's still an original work, where elements might have signs of the work it was trained on but as a whole it's something new....

The thing to remember is that while it might have been able to "see" previous works, and it's true you can still extract some of the training images out of them, the fact is though humans operate like that as well. Again look up Green Green [which was popular in Japan's kids show when the composer grew up] and tell me that the SMW composer didn't effectively reuse the main part.

24 minutes ago, whispous said:

No, you have got the entirely wrong idea about what AI is.

I've downloaded and compiled the source for stable diffusion and played around with the generation of stuff for at least 3 years now; my current job has me involved in a LLM creation...but sure I have the wrong idea about AI.

It is like what I said in this and others, I view AI as almost a predictive model, but the way AI is put together it can for an argument be made that it's consuming images and outputting them in a similarish fashion to a human is [in the sense that I don't think that it should have regulations on the training set data, instead it should have regulations on the output images].

Humans are simply predictive models as well, just very complex ones.

Eigenvektor · April 12

2 hours ago, 05032-Mendicant-Bias said:

Who knows if it'll be illegal to train a model on copyrighted data.

I don't think learning from copyrighted material is necessarily the issue, especially if you've used said copyrighted material with permission from the copyright holder. The issue is that the AI, once trained, may reproduce the copyrighted material either in part of in full.

As you said, people learn from copyrighted material all the time and no one has an issue with that. But if I learn drawing by studying historical works, it can turn into an issue if I start reproducing them, depending on the circumstances.

SolarNova · April 19

Cool ! government oversight on Ai training models, im sure this wont be co-opted for political and/or social agendas........ /s

You give these models ALL info ..or none ...otherwise they end up biased. if you dont like the outcome from said Ai after giving it ALL the data available ..dont interact with the Ai to begin with.

kaiju_wars · April 19

I'm fine with this for "drawing" and "animating" AIs.

StDragon · April 20

On 4/19/2024 at 10:44 AM, SolarNova said:

Cool ! government oversight on Ai training models, im sure this wont be co-opted for political and/or social agendas........ /s

You give these models ALL info ..or none ...otherwise they end up biased. if you dont like the outcome from said Ai after giving it ALL the data available ..dont interact with the Ai to begin with.

When it's all said and done, there will be thousands of LLMs; each heavily biased depending on the nation and politics in which they're curated for.

The toothpaste is out of the tube. People need to get used to shilling biased AI models. It will be a propaganda minefield out there of information overload more than it already is.

QuantumSingularity · April 21

I see generative AI the same way i see battery EVs, which is basically like the early laser disk of our time - came with a lot of hype, supposed to be the future, but something better will actually replace it before it gets adopted by the masses.

With laser disk it was CD and then DVD. With the battery EVs it is going to be the HFC cars. In the case of generative AI it will be True AI which as we all know, once the genie is out of the bottle, there is no way back. Yes, everyone is constantly talking about AI right now, but in its current state, it's essentially just an advanced automation for wider range of processes, which were previously unable to be automated. And as such it should be governed. I am ok with that, but i want the training models to be disclosed not only with the government, but with the public as well. Everyone should have access to that information, just like all the ingredients are listed on all of the food items, instead of simply "it's a drink that gives you energy".

Sign In

US lawmaker proposes a public database of all AI training material used by AI models.

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites