New AI able to clone any voice with just seconds of input data

vanished · November 12, 2019

Source video ^

His sources are quoted in description for the particularly scientific among you

Some of you may be aware of previous techniques that could realistically synthesize human speech, but that requires hours of input data. This new system needs just seconds of source data.

Quote

The timbre of the voice is very similar [to the original], and it is able to synthesize sounds and consonants that have to be inferred because they were not heard in the original voice sample. This requires a certain kind of intelligence, and quite a bit of that. [...]

The speaker encoder is a neural network that was trained on thousands and thousands of speakers and is meant to squeeze all this learned data into a compressed representation. In other words, it tries to learn the essence of human speech from many many speakers. [...] This training step needs to be done only once, and after that it was allowed just 5 seconds of speech data from someone they haven't heard [before].

I'm sure you don't need me to explain how this has absolutely loads of possible applications, ranging from incredibly useful, to incredibly bad. I don't have test results to see how this might or might not be able to fool voice print security systems, but it's definitely conceivable. It also will call into question the reliability of any recorded audio. This has implications for politics, blackmail, crime cases, etc.

The fact it is so easy to do is what really makes it interesting. Being able to do it with hours of samples is one thing, but that's not always easy or possible to get. 5 seconds is quite a different story.

On the flip side, imagine the usefulness for having retired or deceased actors posthumously fulfill a role, particularly in a cartoon where video is not necessary, or for repairing or redoing dialog in a movie or voiceover. Imagine using this to replace traditional voice synthesizers that, though better now than they used to be, often still sound noticeably more robotic than people. Imagine, if like the creator of this video, you often do a lot of speaking content - you can just synthesize the audio using a script if you're feeling ill or lazy. Imagine using old home movies to capture the voice of someone who for one reason or another is no longer able to speak and through this technology, give them the ability to sound like themselves again.

mr moose · November 12, 2019

It won't be long before video and audio evidence can be spoofed so well that such evidence is considered flimsy in courts.

November 12, 2019

Pretty soon a 00s Shaggy song will be a legal defense.

Praesi · November 12, 2019

This+deepfake and we cant trust anything anymore.

vanished · November 12, 2019

5 minutes ago, Praesi said:

This+deepfake and we cant trust anything anymore.

Absolutely. I'm sure that as with deepfakes, new AIs will be created to detect content made this way, but that does little to decrease the risk.

Bombastinator · November 12, 2019

New tech both solves old problems and creates new ones. The progression of technology is logarithmic and it’s reaching towards vertical as we speak.

Dissitesuxba11s · November 12, 2019

From listening to the audio samples, they still lack emotion and sound monotone. Take Trump as an example, his speaking pattern often goes up and down with different inflections so I'm really curious on how it does on a sample of someone speaking freely, vice from a preset sentence.

Flying Sausages · November 12, 2019

New AI able to clone any voice with just seconds of input data

vanished · November 12, 2019

9 minutes ago, Dissitesuxba11s said:

From listening to the audio samples, they still lack emotion and sound monotone. Take Trump as an example, his speaking pattern often goes up and down with different inflections so I'm really curious on how it does on a sample of someone speaking freely, vice from a preset sentence.

I think the most interesting example is the 3rd last on the page:

It takes someone singing in what sounds to me like Chinese (apologies if it is not) and from it renders their voice speaking in English, without any perceivable accent, and yet, with the melodic quality of their speech retained.

Dissitesuxba11s · November 12, 2019

9 minutes ago, Ryan_Vickers said:

I think the most interesting example is the 3rd last on the page:

It takes someone singing in what sounds to me like Chinese (apologies if it is not) and from it renders their voice speaking in English, without any perceivable accent, and yet, with the melodic quality of their speech retained.

Ooh I totally missed those. That's interesting that it reused the melody of the song. From that, it looks like the synthesized voice might be limited to however long the reference is. In other words, if they used that Chinese(?) singing as a reference, it would repeat the melody every ~7 secs for longer synthesized samples.

vanished · November 12, 2019

Just now, Dissitesuxba11s said:

Ooh I totally missed those. That's interesting that it reused the melody of the song. From that, it looks like the synthesized voice might be limited to however long the reference is. In other words, if they used that Chinese(?) singing as a reference, it would repeat the melody every ~7 secs for longer synthesized samples.

I'm not sure, but it gives me hope that "emotion" and inflections can be preserved, if not now, then perhaps with the next version

Trik'Stari · November 12, 2019

1 hour ago, Praesi said:

This+deepfake and we cant trust anything anymore.

You still trusted things? I stopped long ago.

Maybe we'll get lucky and some intrepid do gooder will bring audio and video analysis technology to match this kind of shit. Making it easy to spot fakes.

The good news is, that maybe the wider majority of people will realize that the media lies about almost fucking everything, when someone uses this technology to troll the shit out of them.

Praesi · November 12, 2019

14 minutes ago, Trik'Stari said:

You still trusted things? I stopped long ago.

Maybe we'll get lucky and some intrepid do gooder will bring audio and video analysis technology to match this kind of shit. Making it easy to spot fakes.

The good news is, that maybe the wider majority of people will realize that the media lies about almost fucking everything, when someone uses this technology to troll the shit out of them.

No.

But thats again a new Level.

williamcll · November 13, 2019

4 hours ago, Ryan_Vickers said:

I think the most interesting example is the 3rd last on the page:

It takes someone singing in what sounds to me like Chinese (apologies if it is not) and from it renders their voice speaking in English, without any perceivable accent, and yet, with the melodic quality of their speech retained.

The quoted reference is "輕輕敲醒沉睡的心靈,慢慢張開你的眼睛" , the voice seems like it's from someone else in the research team. The last two references are french but I'm not fluent enough to translate.

Founders · November 13, 2019

I knew that video I saw of Betty White in the gym dead lifting 600lbs while reading Harry Potter was a bit iffy.

Taf the Ghost · November 13, 2019

5 hours ago, mr moose said:

It won't be long before video and audio evidence can be spoofed so well that such evidence is considered flimsy in courts.

They've been able to do this for on or about 5 years already. It just used to cost a lot of money. Now, everyone is going to call every captured Audio or Video a "deep fake". Politicians of the world rejoice at this news.

poochyena · November 13, 2019

5 hours ago, mr moose said:

It won't be long before video and audio evidence can be spoofed so well that such evidence is considered flimsy in courts.

It already can be though. video/audio evidence alone is already considered flimsy.

Drak3 · November 13, 2019

8 minutes ago, Ehmc130 said:

I knew that video I saw of Betty White in the gym dead lifting 600lbs while reading Harry Potter was a bit iffy.

Yeah, it's well documented that she never goes below 750 and that she can only read in Klingon!

desertcomputer · November 13, 2019

5 hours ago, Praesi said:

This+deepfake and we cant trust anything anymore.

Well rip, big corporation have already been collecting both our face/voice data for years.

BuckGup · November 13, 2019

So can I get my hands on the code or access to the model?

vanished · November 13, 2019

8 minutes ago, BuckGup said:

So can I get my hands on the code or access to the model?

I believe so, when I skimmed the description of the video I seem to recall seeing that they opened up part of it

BuckGup · November 13, 2019

8 minutes ago, Ryan_Vickers said:

I believe so, when I skimmed the description of the video I seem to recall seeing that they opened up part of it

All I saw was a link to code that was made independently of the researchers that was similarly based in how it worked

vanished · November 13, 2019

20 minutes ago, BuckGup said:

All I saw was a link to code that was made independently of the researchers that was similarly based in how it worked

Ah I see yeah an unofficial implementation. I guess you'd have to make your own, but theoretically if they can then anyone else with access to the paper can as well

BuckGup · November 13, 2019

6 minutes ago, Ryan_Vickers said:

Ah I see yeah an unofficial implementation. I guess you'd have to make your own, but theoretically if they can then anyone else with access to the paper can as well

Yeah true. Sounds like a bit of work though. With deepfakes we no longer need actors now. Also the pron will be eccentric

vanished · November 13, 2019

Just now, BuckGup said:

Yeah true. Sounds like a bit of work though. With deepfakes we no longer need actors now. Also the pron will be eccentric

Well I don't think it's quite good enough for 4K hollywood grade films, but clearly that day will come, and probably sooner than we think

Sign In

New AI able to clone any voice with just seconds of input data

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

New AI able to clone any voice with just seconds of input data

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment