Jump to content

Linus AI Voice Clone

AnujSaharan

I've been training my own GAN-based TTS and now diffusion-based TTS models for quite a bit (the eventual goal is to have a 'teacher' model teach a cloned voice how to sing and rap fwiw). I've seen the guys try a couple different models zero-shot to try and clone their voices - hasn't quite hit ever, so trying to fix that.

Here's a super early preliminary attempt on a ~500m parameter TTS model fine-tuned with Linus' voice from the most recent WAN show. Only fine-tuned it for ~15 minutes on a single 3080, very undertrained obviously, can probably get much better with more time. 馃檪

Just making a thread to track progress until it sings.

Novel text from a random The Verge review to test Linus' voice against:

image.png.16d7a577ab5e682d2a5597e769ecbf1b.png

Generated Audio:

The model is autoregressive a la GPT-2 and Tortoise. So based on speech and words it's seen before - it may choose to change the emotional tone, add pauses, different words etc based on the training data and preceding text while generating - for example it added "i.e." and uhhs and umms near the end of the clip on its own - I straight copy pasted the highlighted text above.聽

Rate on a scale of 1-10 in its current state?

---

If you're interested in TTS btw - I post some experiments on my Twitter - (Anuj Saharan (@theAnujSaharan) / Twitter).

Link to comment
Share on other sites

Link to post
Share on other sites

7. Try training by for 1hr+ and try again. There is a very odd stutter before put the two

I try to be a human, but I cannot, because I have returned to monke.

Spoiler

Hehe boi

Spoiler

POV- when it can run crysis-

聽( 蜐掳 蜏蕱汀掳)

Link to comment
Share on other sites

Link to post
Share on other sites

Have you thought to check with Linus as to whether he's comfortable with his voice being cloned or no? If no, then 0. Should get folk's permission first.

Intel HEDT and Server platform enthusiasts:聽Intel HEDT Xeon/i7 Megathread

Main PC

CPU: i9 7980XE @4.5GHz/1.22v/-2 AVX offset聽

Cooler: EKWB Supremacy Block - custom loop w/360mm +280mm rads聽

Motherboard: EVGA X299 Dark聽

RAM:4x8GB HyperX Predator DDR4 @3200Mhz CL16

GPU: Nvidia FE 2060 Super/Corsair HydroX 2070 FE block

Storage:聽 1TB MP34 + 1TB 970 Evo + 500GB Atom30 + 250GB 960 Evo聽

Optical Drives: LG WH14NS40聽

PSU: EVGA 1600W T2聽

Case & Fans: Corsair 750D Airflow - 3x聽Noctua iPPC NF-F12 + 4x Noctua iPPC NF-A14 PWM

OS: Windows 11

Display:聽LG 27UK650-W (4K 60Hz IPS panel)

Mouse: EVGA X17

Keyboard: Corsair K55 RGB

Mobile/Work Devices: 2020 M1 MacBook Air (work computer) - iPhone 13 Pro Max - Apple Watch S3

Other Misc Devices: iPod Video (Gen 5.5E, 128GB SD card swap, running Rockbox), Nintendo Switch

Link to comment
Share on other sites

Link to post
Share on other sites

16 hours ago, Zando_ said:

Have you thought to check with Linus as to whether he's comfortable with his voice being cloned or no? If no, then 0. Should get folk's permission first.

Waiting for Linus to copyright his voice and likeness.

^^^^ That's my post ^^^^
<-- This is me --- That's your scrollbar -->
vvvv Who's there? vvvv

Link to comment
Share on other sites

Link to post
Share on other sites

Of course, I am happy to stop experimenting and building if he doesn't approve or is uncomfortable - very obviously haven't posted or distributed the model itself or any inferencing scripts for privacy reasons. Will let聽@nicklmg聽or someone from the team make that call. If this gets good enough - happy to even share the model for video editing voiceovers, dubbing into foreign languages, and whatever else use case.聽

Although I will say - similar technology is out there on the web - anyone can take a small snippet and try zero-shot cloning on elevenlabs or something like that - irrespective of whether it is a good result in the end or not - and that'd be from fully anonymous sources cloning using fully untraceable models that live behind at least an LLC level protection. Taking offence on someone else's behalf on the output of a model is a conversation that goes far beyond just this forum thread - and applicable to gpt, dall-e, stable diffusion etc etc (all of which are in the open domain and easily accessible to everyone) - and I am happy to take the direction of wherever that public discourse goes.

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, AnujSaharan said:

Will let聽@nicklmg聽or someone from the team make that call.

Could also just ask @LinusTech

I'm not actually trying to be as grumpy as it seems.

I will find your mentions of Ikea or Gnome and I will /s post.聽

Project Hot Box

CPU聽13900k, Motherboard聽Gigabyte Aorus Elite AX, RAM聽CORSAIR Vengeance 4x16gb 5200 MHZ, GPU Zotac RTX 4090 Trinity OC,Case聽Fractal Pop Air XL, Storage Sabrent Rocket Q4 2tb,聽CORSAIR Force Series MP510 1920GB NVMe, CORSAIR FORCE Series MP510 960GB NVMe, PSU CORSAIR HX1000i,聽Cooling Corsair XC8 CPU block, Bykski GPU block, 360mm and 280mm radiator,聽Displays聽Odyssey G9, LG 34UC98-W 34-Inch,Keyboard聽Mountain Everest Max,聽Mouse聽Mountain Makalu 67, Sound聽AT2035, Massdrop 6xx headphones, Go XLR聽

Oppbevaring

CPU聽i9-9900k,聽Motherboard,聽ASUS Rog Maximus Code XI, RAM, 48GB Corsair Vengeance LPX 32GB 3200 mhz (2x16)+(2x8)聽GPUs聽Asus ROG Strix 2070 8gb, PNY 1080, Nvidia 1080,聽Case聽Mining Frame, 2x聽Storage聽Samsung 860 Evo 500 GB,聽PSU聽Corsair RM1000x and RM850x,聽Cooling聽Asus Rog Ryuo 240 with Noctua NF-12 fans

Why is the 5800x so hot?

Link to comment
Share on other sites

Link to post
Share on other sites

It's a start, but the pace and tone make it sound more like somebody else doing a Linus impression.

I threw your WAV at Audacity, sped the clip up by 6.9%, then sped the tempo up a further 8%, and manually tightened up some of the weird pauses it put in the middle of sentences. Here's the result:

It still has problems choosing appropriate pacing and inflection like all other AI-generated speech, but it's a good start! I want to sic this on Majel Barrett and Lorenzo Music's voices.

AI-generated WAN Show Forver let's gooooooooooo

I sold my soul for ProSupport.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, AnujSaharan said:

Although I will say - similar technology is out there on the web - anyone can take a small snippet and try zero-shot cloning on elevenlabs or something like that - irrespective of whether it is a good result in the end or not - and that'd be from fully anonymous sources cloning using fully untraceable models that live behind at least an LLC level protection. Taking offence on someone else's behalf on the output of a model is a conversation that goes far beyond just this forum thread - and applicable to gpt, dall-e, stable diffusion etc etc (all of which are in the open domain and easily accessible to everyone) - and I am happy to take the direction of wherever that public discourse goes.

"Someone else would do the wrong thing if I didn't first" is a poor argument. Ask folk's permission first. This isn't some crazy new ground we're treading as a society. Disney has already reproduced dead actors likeness' (face and voice), only after permission/license from their estate (as obviously they weren't around to ask). Not a wild stretch to expect the same for the living.

4 hours ago, LogicalDrm said:

Waiting for Linus to copyright his voice and likeness.

Yeah... even remotely public figures are going to have to start doing that aren't they :/.

Intel HEDT and Server platform enthusiasts:聽Intel HEDT Xeon/i7 Megathread

Main PC

CPU: i9 7980XE @4.5GHz/1.22v/-2 AVX offset聽

Cooler: EKWB Supremacy Block - custom loop w/360mm +280mm rads聽

Motherboard: EVGA X299 Dark聽

RAM:4x8GB HyperX Predator DDR4 @3200Mhz CL16

GPU: Nvidia FE 2060 Super/Corsair HydroX 2070 FE block

Storage:聽 1TB MP34 + 1TB 970 Evo + 500GB Atom30 + 250GB 960 Evo聽

Optical Drives: LG WH14NS40聽

PSU: EVGA 1600W T2聽

Case & Fans: Corsair 750D Airflow - 3x聽Noctua iPPC NF-F12 + 4x Noctua iPPC NF-A14 PWM

OS: Windows 11

Display:聽LG 27UK650-W (4K 60Hz IPS panel)

Mouse: EVGA X17

Keyboard: Corsair K55 RGB

Mobile/Work Devices: 2020 M1 MacBook Air (work computer) - iPhone 13 Pro Max - Apple Watch S3

Other Misc Devices: iPod Video (Gen 5.5E, 128GB SD card swap, running Rockbox), Nintendo Switch

Link to comment
Share on other sites

Link to post
Share on other sites

On 4/8/2023 at 1:56 PM, Needfuldoer said:

It's a start, but the pace and tone make it sound more like somebody else doing a Linus impression.

I threw your WAV at Audacity, sped the clip up by 6.9%, then sped the tempo up a further 8%, and manually tightened up some of the weird pauses it put in the middle of sentences. Here's the result:

It still has problems choosing appropriate pacing and inflection like all other AI-generated speech, but it's a good start! I want to sic this on Majel Barrett and Lorenzo Music's

voices.

Seems to be find on the tempo now, thanks for that callout.

First line from the new video to test - "It looks like a children's toy but it's actually one of the most versatile hacking tools to ever hit the market. And if you've been on TikTok in the last six months, there's a good chance you've seen people using it to change gas station signs, set off department store PA systems and open up Tesla charging ports."

The base model is meant to be more 'conversational' than presenter voice or whatever it may be, and that's what's reflected here - of course WAN show being the fine-tune data is also conversation and unscripted audio - and therefore the model seems to be making choices around where to take breaths, pauses etc. (it adds uhms and ahhs even though its not explicitly in the sentence it should be generating) - which is obviously uncharacteristic for edited audio like on all the videos the channel - so the comparison isn't apples to apples.聽

WAN show conversation isn't as high energy and tempo etc - that being said, it did just learn to speed it up and sample better with a little bit more training. I can try making the fine-tune dataset more diverse later for better results.

Link to comment
Share on other sites

Link to post
Share on other sites

21 hours ago, Zando_ said:

"Someone else would do the wrong thing if I didn't first" is a poor argument. Ask folk's permission first. This isn't some crazy new ground we're treading as a society. Disney has already reproduced dead actors likeness' (face and voice), only after permission/license from their estate (as obviously they weren't around to ask). Not a wild stretch to expect the same for the living.

Disney made money from it and publicly distributed the likeness for monetary gains. I would be in the wrong if I were publicly sharing checkpoints and inference scripts myself for someone else's voice - I fully agree with you. I have no plans to do that.

Asked for permission above - if unacceptable, happy to stop posting the little snippets.

Link to comment
Share on other sites

Link to post
Share on other sites

Now all we need is a Luke bot voice.聽

AI generated WAN Show Forever let鈥檚 gooooo!

I sold my soul for ProSupport.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now