Jump to content

Uncanny valley passing AI voice cloning, a new tool for video creators like LTT? | Dark side of audio cloning

Permik

Summary

New, uncanny valley passing AI voice cloning is here, right now. But as always, there's downsides when this is made available to the masses.

 

elevenlabs.thumb.png.c16f7a8bd245d0afda3447edbd822ce1.png

Quotes

Quote

"On Monday, ElevenLabs, founded by ex-Google and Palantir staffers, said it had found an “increasing number of voice cloning misuse cases” during its recently launched beta." - Vice[1]


"4chan members used ElevenLabs to make deepfake voices of Emma Watson, Joe Rogan, and others saying racist, transphobic, and violent things."  - Vice[1]

 

"The clips uploaded to 4chan on Sunday are focused on celebrities. But given the high quality of the generated voices, and the apparent ease of creation, they highlight the of risk deepfake audio clips" - Vice[1]

 

"Crazy weekend - thank you to everyone for trying out our Beta platform. While we see our tech being overwhelmingly applied to positive use, we also see an increasing number of voice cloning misuse cases. We want to reach out to Twitter community for thoughts and feedback!" - ElevenLabs on Twitter[2]

 

"Though it's not clear which of the 4chan clips were created using the ElevenLabs beta, one post contained a link to ElevenLabs' Prime Voice AI, suggesting the company's software may have been employed." - PCMag[3]

 

"Our current (safeguard) ideas:
(1) Additional account verifications to enable Voice Cloning: such as payment info or even full ID verification
(2) Verifying copyright to the voice by submitting sample with prompted text
(3) Drop Voice Lab altogether and manually verify each Cloning Request" - ElevenLabs on Twitter[4]

Any text in italics is added/modified for brevity/clarity/context.

 

My thoughts

This article is really an excuse just to post about ElevenLabs and bring them to the radar for the LTT team.

This is really huge, the whole LTT-channel family could use this for their future A.I. dubbed content if that's still an active avenue that they're making work towards.
Multilingual AI-generated content is also a piece of Elevenlabs' mission[5] and could make the future dubbed content less jarring as there's a possibility of preserving the original voice of the host while translated transcript is being read AI that has been trained on the voice of the host. [5]

But the results speak for themselves, just try it out on their website: ElevenLabs (Note: Main page restricts you to 3-4 voice clips before they ask you to sign up but after sign up, you can generate speech for free up to 10k characters per month)
And to see more impressive results (e.g. laughing AI[6], fully synthetic voices[7] ) without shelling out any money, you can check out their blog[8][6], [7], [8]

 

 

Addendum: ElevenLabs is distinct from the 15.ai project which became known for the speech synthesis from game assets

Notes for the WAN-show notetaker:

By whipping up the AI, you could generate a game segment for the next show, something like "Did I say that?", where you feed the voice clone AI speech of Linus and Luke, and create voice things that they might have said, but have not and try to find odd clips of things that they have actually said and they try to guess if they've actually said something. "That's a free tech tip for ya!" 😄
If the community already has some good clips of some un-/hinged things that they've said, perhaps provide them in the replies?

 

Aside, where I personally picked this story up from:

For the fans of Half-Life, here's a clip of Eli voice cloned and reciting a pretty sophisticated prompt without any issues: (Content warning, contains crude language, namely the f-bomb)
https://youtube.com/clip/UgkxobfZIFhtMqQqVGzW7Kx049pDgEJHTipm

And here's a clip of the "Passionate gamer" – Tyler McVicker reacting to a voice clone of themselves:
https://youtube.com/clip/Ugkx1OoJ39-MTbuPsZSnhelMdjlCEiozcDOE

You can find all of these clips in an archived "Letting off Steam"-stream of his, here:
Tyler McVicker – Letting off Steam 2023 #1 – ElevenLabs segment 58:07–1:04:51 (Timestamped link)

 

 

 

Sources

[1] Motherboard, Tech by Vice – AI-Generated Voice Firm Clamps Down After 4chan Makes Celebrity Voices for Abuse

[2] ElevenLabs on Twitter, "Crazy weekend", 2023-01-30

[3] PCMag.com – AI Voice-Cloning Tool Misused for Deepfake Celeb Clips

[4] ElevenLabs on Twitter, "Current (safeguard) ideas", 2023-01-30

[5] ElevenLabs Website – About, "Our mission"-highlighted

[6] ElevenLabs Blog – The first AI that can laugh

[7] ElevenLabs Blog – This Voice Doesn't Exist - Generative Voice AI

[8] ElevenLabs Blog

 

Sorry for the ghetto Wikitext-ass formatting, but it really feels like the best way to mark up a text with references, keeps the text more readable and keeps the links out from the body of the text.

Edited by Permik
Added an addendum section for clarity.
Link to comment
Share on other sites

Link to post
Share on other sites

Also, how do I get rid of the emoji auto-replacement, it's depriving me of my Finnish internet identity of using

:D

 

Link to comment
Share on other sites

Link to post
Share on other sites

If you want a fake Linus voice, tts.monster already has that. 

https://tts.monster/static/sounds/linus.mp3 if it doesn't play.

 

Now, just because I have to chime in on TTS stuff, cause that's a pet interest that I have actually done.

 

Voicecloning, has been a thing for 2+ years, stuff you can do on your own GPU and CPU. There have been multiple sites that have done this.

 

There is always three aspects to AI audio projects

Input (ASR)

output (TTS)

and style-transfer and/or learning-transfer

 

If I train a voice model, like actually go out there and grab 10 minutes of anyone on the internet, and put it into my dataset, it will produce a "good enough" voice, provided it is the ONLY voice in training. However 10 minutes is nowhere near enough to cover every kind of prosody. So what you usually do is a warm-start, zero-shot training. Which is really just style transfer over the underlying trained model, which is usually LibriTTS or some combination of LibriTTS and VCTK.

 

LibriTTS will give you a generally American-sounding accent unless a word does not exist in CMUDict, and then it will try to assemble it phonetically which doesn't work for foreign words.

 

Now the fun stuff. What if you want to have X's voice, but Y language/accent? That is something you can ONLY do with AI tech, and it saves an actual person that isn't a polyglot from needing to know how to speak those languages natively.

 

The problem, overall is that voicecloning is a bit different from DALL-E/stable diffusion style transfer.

 

Zero-shot training, is the equivalent of what DALL-E does. Zero-shot training relies on the original training data. So if there is no existing training data that is "close" to the zero-shot sample, then it will sound really weird.

 

You can get a much better result from having 10 people and 30 minutes each with the accent, and prosody you want, and every result of your target will have the same accent and prosody. 

 

This tool, doesn't do high-pitch voices well, and I can explain that easily. Zero-shot training simply can not do that. It can only "clone" a voice in the same pitch that it's already been trained on, and NO dataset has childrens voices, Asian voices, ESL voices, African voices, or Hispanic voices. They are extremely overweight in male, American new-England or Midwest voices. So if you use a zero-shot training on something that never existed in the dataset, just like DALL-E, it will not be able to replicate it.

 

Voice Conversion, is the act of taking an input voice and transforming it into the output voice. This is what projects like RADTTS aim to achieve. Where as COQUI aims a bit lower, and can only convert voices that it was already trained on. It's still entirely possible to just straight up clone someone's voice convincingly if you have enough of their voice, regardless of what project you use.

 

How you detect it however is another story. https://www.usenix.org/conference/usenixsecurity22/presentation/blue

All existing "deepfake" projects are based on low quality training data. Every model you can find, for any project is 22khz, because that's what fits on a conventional 8GB GPU to do a zero-shot training inference. If you want to fool forensics, the input dataset and output audio has to be 96khz to cover all the "unheard-by-human-ears" parts of the audio for the trainer to learn, and also those recordings have to be lossless, and done with microphones matching what the target uses. So if you're trying to deepfake someone with a Shure 7B, and you only have access to 128kbit MP3/OGG/AAC/OPUS lossy recordings they made 10 years ago, of course it's not going to fool anyone who is familiar with their content. This is why "deepfaking" an actor from a film is more believable than deepfaking a voice actor in a game. The audio you hear in film is usually from hidden Lavalier microphones that produce lower quality audio that is easier for an AI to deepfake. But voice actors? Usually high-end studio microphones, much harder to deepfake.

 

You'd think that high end would be easier to deepfake because the signal is cleaner, but that's where you're wrong. Most of the training datasets that already exist for AI TTS/VC is downsampled to 22khz audio.

 

VCTK https://datashare.ed.ac.uk/handle/10283/2950 which was recorded in 96khz, using DPA 4035 condenser microphone ($700), and a Sennheiser MKH 800 ($3200), and this is almost certainly what every "voice cloning" project is using for the basis of zero-shot training alone. This is because it produces extremely clean training data.

 

But if you look at the VCTK voice list, you'll discover exactly what I said about accents and voice variety. 

62 female voices, 47 Male voices

Between the ages of 18 and 38 (Most being 22-25, with only 3 speakers in their 30's)

34 England (Mostly Southern England)

22 American (2 NY, 2 New Jersey, 2 California)

19 Scottish (5 Edinburgh, 3 Fife, 2 Aberdeen)

15 Irish (9 Ireland, 6 Northern Ireland)

8 Canadian (5 Toronto, 2 Alberta, 1 Quebec)

4 South African

3 Indian

1 Welsh

2 Australian

1 New Zealand

 

While these websites are kinda cool to mess around with, calling them "convincing" is just like calling DALL-E professional level work. It's not, it's nowhere close. 

 

Seriously, you're not going to get any VoiceCloning system to be able to scream, yell, or laugh. That is something that only voice conversion with the input speaker trained on the same model as the output speaker, and even then, that will only be convincing to people who rarely hear the subject emote like that. So if you're going to pick a target to clone, pick someone who has a narrative character to their voice, because that is what TTS systems can do. It can not do conversation prosody well enough to be convincing to a target's mother.

 

Link to comment
Share on other sites

Link to post
Share on other sites

I am very worried by the people who think this can be used to spread fake news.

I am worried because they seem to think that now that AI voice tools are out, they can no longer trust audio clips of X person saying Y. I am worried because you have never been able to trust audio clips, and it sounds like many people have done so up until now.

 

Even before AI tools like Primal AI Voice, people had access to a wide variety of tools for creating audio clips of one person saying something. Big companies have had the ability to do this for a long time. For crying out loud Disney managed to recreate a dead actor in Rogue One, not just the voice but the visuals as well, and that was done over 6 years ago. There are probably examples from even longer ago of just the voice.

Even if you didn't have access to all the resources of Disney, you could still fake things by stitching together several audio clips to create new sentences, and with a bit of smoothing and careful selection of clips you could create convincing clips of person X saying a sentence they had never said before.

 

Or if we go back even further, to the time when we didn't have general purpose computers, people could still fake person X saying Y by just hiring (or being) an impersonator. You know, the field of work that predates computers. Here is a one guy (Rich Little, famous impressionist) in the 80's "faking" a conversation between 5 different celebrities

 

 

Tools like Primal AI Voice just makes the ability to fake things more widely available. It is no longer limited to those few people/companies in power (Disney, news organisations, etc) or those few who became good impressionists through a lot genetics and training.

Link to comment
Share on other sites

Link to post
Share on other sites

@LAwLzI was going to make a point about completely convincing fakes being available but exceptionally expensive work since the late 2000s. But then I remembered we've had 50+ years of discussion about the truth of the Zapruder film and, well, nothing really will change.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, LAwLz said:

I am very worried by the people who think this can be used to spread fake news.

I am worried because they seem to think that now that AI voice tools are out, they can no longer trust audio clips of X person saying Y. I am worried because you have never been able to trust audio clips, and it sounds like many people have done so up until now.

 

Even before AI tools like Primal AI Voice, people had access to a wide variety of tools for creating audio clips of one person saying something. Big companies have had the ability to do this for a long time. For crying out loud Disney managed to recreate a dead actor in Rogue One, not just the voice but the visuals as well, and that was done over 6 years ago. There are probably examples from even longer ago of just the voice.

Even if you didn't have access to all the resources of Disney, you could still fake things by stitching together several audio clips to create new sentences, and with a bit of smoothing and careful selection of clips you could create convincing clips of person X saying a sentence they had never said before.

 

Or if we go back even further, to the time when we didn't have general purpose computers, people could still fake person X saying Y by just hiring (or being) an impersonator. You know, the field of work that predates computers. Here is a one guy (Rich Little, famous impressionist) in the 80's "faking" a conversation between 5 different celebrities

 

 

Tools like Primal AI Voice just makes the ability to fake things more widely available. It is no longer limited to those few people/companies in power (Disney, news organisations, etc) or those few who became good impressionists through a lot genetics and training.

It's all moot now anyway,  no one seems to rely on evidence when they want make accusations about people they don't like.  And they have been ignoring the evidence when they don't like what it means for way longer.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

Fake your own music kek.

| Ryzen 7 7800X3D | AM5 B650 Aorus Elite AX | G.Skill Trident Z5 Neo RGB DDR5 32GB 6000MHz C30 | Sapphire PULSE Radeon RX 7900 XTX | Samsung 990 PRO 1TB with heatsink | Arctic Liquid Freezer II 360 | Seasonic Focus GX-850 | Lian Li Lanccool III | Mousepad: Skypad 3.0 XL / Zowie GTF-X | Mouse: Zowie S1-C | Keyboard: Ducky One 3 TKL (Cherry MX-Speed-Silver)Beyerdynamic MMX 300 (2nd Gen) | Acer XV272U | OS: Windows 11 |

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, Kisai said:

This tool, doesn't do high-pitch voices well, and I can explain that easily. Zero-shot training simply can not do that. It can only "clone" a voice in the same pitch that it's already been trained on, and NO dataset has childrens voices, Asian voices, ESL voices, African voices, or Hispanic voices. They are extremely overweight in male, American new-England or Midwest voices. So if you use a zero-shot training on something that never existed in the dataset, just like DALL-E, it will not be able to replicate it.

I've been playing with Tortoise TTS with an idea of using it for making some comedy gaming videos, and I think you just explained what I've observed. I haven't tried the one in OP, but presume there are similarities in approach. Tortoise takes ~30s of 22k input samples and uses that to output TTS. I've had mixed results. It seems to work ok on adult male. I did try it on fake child female voice (voice acting) and that ended up sounding like adult female with a noticeable American accent to it.

 

I've even been considering using it to replace myself. I suck at reading my own scripts for pre-recorded content, and wonder if AI me could do a better job. Still working on that.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Alienware AW3225QF (32" 240 Hz OLED)
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, iiyama ProLite XU2793QSU-B6 (27" 1440p 100 Hz)
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, porina said:

I've been playing with Tortoise TTS with an idea of using it for making some comedy gaming videos, and I think you just explained what I've observed. I haven't tried the one in OP, but presume there are similarities in approach. Tortoise takes ~30s of 22k input samples and uses that to output speech. I've had mixed results. It seems to work ok on adult male. I did try it on fake child female voice (voice acting) and that ended up sounding like adult female with a noticeable American accent to it.

Yep, that's what happens every time on zero-shot training. You have to adjust the pitch in the TTS side and the length scale to compensate for it being sped up. End-to-End TTS is better at replicating an individual, but you still need around 10 minutes of audio to get there.

 

Like I've trained Coqui VITS on a single subject using a single podcast, and it produced a fairly convincing voice, and then I added 10 others to it, and what happened is that the overall American accent that 7 of the subjects had, made all of them have that accent, and adding or removing the one Asian in the group had an over-bearing effect on the accent because it basically "Corrupted" the base, and resulted in having fused some phones together since the matching transcription has to be literal.

 

But my zero-shot model's style transfer only works on voices with a similar F0

 

7 minutes ago, porina said:

I've even been considering using it to replace myself. I suck at reading my own scripts for pre-recorded content, and wonder if AI me could do a better job. Still working on that.

That, really is one of the handful of use cases for stuff like this

- Reading text back to yourself

- Having a voice when you lose it (eg sick)

- Using your younger voice 

- Using two versions of your own voice to talk to "each other"

- Using your voice in AI assistants instead of the stock voice

- Using it to keep your pets or children entertained

 

Link to comment
Share on other sites

Link to post
Share on other sites

This is going to take mass social engineering to a whole new level 

i5 2400 | ASUS RTX 4090 TUF OC | Seasonic 1200W Prime Gold | WD Green 120gb | WD Blue 1tb | some ram | a random case

 

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, LAwLz said:

I am very worried by the people who think this can be used to spread fake news.

I am worried because they seem to think that now that AI voice tools are out, they can no longer trust audio clips of X person saying Y. I am worried because you have never been able to trust audio clips, and it sounds like many people have done so up until now.

This is a good thing. It poisons MSM and their corrupt propaganda working in concert with the state apparatus. Deepfakes will force everyone to question everything.

 

A society left unsure and confused via information warfare is far more preferable to one that's been brainwashed via the media/state with a coherent narrative.

 

The only antidote to an unholy order is chaos.

Link to comment
Share on other sites

Link to post
Share on other sites

I hope that this inspired this day's WAN-show 😄

Link to comment
Share on other sites

Link to post
Share on other sites

I am shocked that passed uncanny valley for anyone with an emotional attachment to LTT (since uncanny valley requires the humanoid to elicit an emotional response from humans observing it)

 

If it can't fool Luke it likely won't fool anyone else who has a friendship or emotional connection with the person being mimiced.

 

Also, those text to speech inflections were pure 1990's smash cut voice dubbing with all the awkward pauses and out of cadence tells we have had for generations. Even my wofe picked out the fakes and she's a 1/month viewer at best.

The best gaming PC is the PC you like to game on, how you like to game on it

Link to comment
Share on other sites

Link to post
Share on other sites

If I need to phone my bank, I authenticate myself by saying 'my voice is my password'. I'm assuming AI voice cloning renders that tech obsolete.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, GhostRoadieBL said:

I am shocked that passed uncanny valley for anyone with an emotional attachment to LTT (since uncanny valley requires the humanoid to elicit an emotional response from humans observing it)

 

If it can't fool Luke it likely won't fool anyone else who has a friendship or emotional connection with the person being mimiced.

 

Also, those text to speech inflections were pure 1990's smash cut voice dubbing with all the awkward pauses and out of cadence tells we have had for generations. Even my wofe picked out the fakes and she's a 1/month viewer at best.

Its one of the weaker ones I have heard, it can do people like Joe Rogen much better.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, Monkey Dust said:

If I need to phone my bank, I authenticate myself by saying 'my voice is my password'. I'm assuming AI voice cloning renders that tech obsolete.

I can't use that tech, My voice actually changes frequently, I have been asked which part of the UK I am from and then later in the week "Are you from the US"?  Sometimes I get offered jobs high up in wanky places because at parties I can sound posh, most of the time I sound like a bogan and occasionally when I am sick or tired I just sound drunk.   I have tried to use it for several services but it always makes me repeat myself about 20 times before it sends me to an operator because it thinks I am an imposter.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, GhostRoadieBL said:

I am shocked that passed uncanny valley for anyone with an emotional attachment to LTT (since uncanny valley requires the humanoid to elicit an emotional response from humans observing it)

 

If it can't fool Luke it likely won't fool anyone else who has a friendship or emotional connection with the person being mimiced.

 

Also, those text to speech inflections were pure 1990's smash cut voice dubbing with all the awkward pauses and out of cadence tells we have had for generations. Even my wofe picked out the fakes and she's a 1/month viewer at best.

 

In general, the best way , right now, to see if something is a TTS or Voice conversion is ask them to laugh at dumb joke. Because No TTS can actually emote a vocalized sound like a laugh or scream

 

They can however style-transfer emotive tone, like being angry, upset, or cheerful. That is not something this tech can do I believe. As per anything that works like Dall-E does, it has to have seen it, and it has to have a symbol of it, and because humans don't have written language for emotive tones, it simply never will use them correctly or at all.

 

For example https://en.wikipedia.org/wiki/Voice_Quality_Symbols

The thing is, you will NOT find, not in ASR, and not in TTS any mechanism to denote this. It is theoretically possible to make ASR transcribe Voice Quality Symbols (I know VOSK will write out laughing as "hahaha") it's just that none do. Likewise people don't write it out either, because it serves no purpose a human would use it for.

 

So coming around full circle on this. 

 

If you don't want your voice stolen, the easiest solution is to simply make sure your environment can't do it. Pink noise (rain/waves), or non-repeating background music, should always flow under your speech. If it's highly repetitive at the same volume (such as computer fans) it's easy to remove.

 

If you work at a bank or some other contact center that deals with people over the phone, make sure that you are listening for the tells. The biggest tell, is always going to be that the generated voice can't immediately respond, interrupt, or be interrupted. Like if you were to say, "okay read out your phone number", most people would just rattle off their phone number immediately, eg 1-2-3-4-5-6-7-8-9-0, however a TTS will either incorrectly read it as 1,234,567,890, or the person typing it in, will take a few seconds to ensure it won't rattle it off as one billion, two hundred thirty four million,  five hundred sixty seven thousand, eight hundred ninety. If you were to try and interrupt someone saying their name or phone number, they would normally repeat the last thing they said right?

 

There are a few more tells that are unlikely to be overcome. Emotive responses are always going to be a big tell, and maybe at some point someone may make an ASR model that takes into account voice quality, and thus a TTS can then learn to do it as well. But that's probably low on the tree of deepfakery. Voice conversion is the lowest hanging fruit, because that lets you translate one voice/language to another while keeping some semblance of the input emotive state.

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×