Jump to content

In an onion-esque headline - Amazon demonstrates mimicking a dead relative's voice to “make memories last”

Summary

 

 Amazon announced at its re:MARS conference today that it's working on feature that can mimic a person's voice from just a minute of reference speech.

 

Quotes

Quote

In a demonstration video, a child said, “Alexa, can Grandma finish reading me the Wizard of Oz?”

Alexa confirmed the request with the default, robotic voice, then immediately switched to a softer, more humanlike tone, seemingly mimicking the child’s family member.

The Alexa team developed a model that allows its voice assistant to produce a high-quality voice with “less than a minute of recorded audio,” Prasad said.

The feature is currently in development, Prasad said. Amazon did not say when the feature will roll out to the public.

 

My thoughts

 The first thought that rolled into my mind when reading that headline was "holy **** that's creepy". Then i remembered LTT's video on AI voice synthesizing and that put my nerves at ease. While it seems like a cool idea i doubt today's tech can replicate a human's voice accurately, let alone with just 60 seconds of reference audio. If Amazon does somehow accomplish accurate voice synthesizing i believe this can be kind of a dangerous tool to be in the hands of anyone with an Echo.

 

Sources 

CNBC: Amazon demonstrates Alexa mimicking the voice of a deceased relative. https://www.cnbc.com/2022/06/22/amazon-demonstrates-alexa-mimicking-the-voice-of-a-deceased-relative.html

Link to comment
Share on other sites

Link to post
Share on other sites

No way this will be abused at all. If it only needs a minute of audio what's to stop someone from using it to mimic a person who is still alive's voice?

 

That's the thing with these ai voice systems at the moment.  The training voice is the person reading a consent agreement. can't do that with dead people. 

🌲🌲🌲

 

 

 

◒ ◒ 

Link to comment
Share on other sites

Link to post
Share on other sites

Is it just me or does anyone else think that having an AI say things this person never actually said in their voice would not help cope with the loss of a loved one in the slightest? I understand voice recordings of the actual person, yes. But this seems a bit... far?

Link to comment
Share on other sites

Link to post
Share on other sites

Recording and storing vocal patterns....
Hmmmm. 🤔

"If you ever need anything please don't hesitate to ask someone else first"..... Nirvana
"Whadda ya mean I ain't kind? Just not your kind"..... Megadeth
Speaking of things being "All Inclusive", Hell itself is too.

 

Link to comment
Share on other sites

Link to post
Share on other sites

Pairing a dear one voice with a chatbot seems ripe for abuse, e.g. by automating frauds and scams.
On the plus side it might have therapeutic applications when used by professionals.

I would like for regulations that makes it clear one is talking with a computer, e.g. with a periodic reminder: "Dear, I remind you that this is a piece of code mimiking your dead grandma voice."

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, 05032-Mendicant-Bias said:

On the plus side it might have therapeutic applications when used by professionals.

no it wouldn't.

this would make letting go even harder maybe to the point of delusional belief that the loved one is still there in your echo.

imagine a child trying to accept the loss of their grandma with this:

Quote

dad: hey son, i know you miss grandma but we're here for you even if she can't be.
kid: hey alexa! make grandma say "i love you"!

alexa: i love you!
kid: see dad, grandma's still here! she's in the alexa!

this can happen without an alexa, children can attach their loved ones to an inanimate object such as a toy (even more so if the toy was hand made by the loved one) believing that they are living within it.
getting them to move on from this is already a nightmare even for trained therapists knowledgeable in this area, an alexa would make this already bad problem even worse!

regarding the legal and technical area. it's a fucking gray area. disney used cgi for Peter Cushing in rouge one, sight issue, he's been dead for over 20 years how the hell do we know he's ok with this?!

this tech shouldn't be developed because of how fucked it can, i don't want my voice being used years after im dead for a fucking scam. i am not ok with that.

*Insert Witty Signature here*

System Config: https://au.pcpartpicker.com/list/Tncs9N

 

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Salv8 (sam) said:

it's a fucking gray area. disney used cgi for Peter Cushing in rouge one, sight issue, he's been dead for over 20 years how the hell do we know he's ok with this?!

it should be illegal. but only on death, unless a given permission? or about recreation, youth and other stuff. or just a character generated that looks very similar.

if they just startys firing people, take their digital "ID" and use it to create both in body and voice. should break a lot of privacy rules and regulations, and lot of shady deals in how its "accepted".

 

back to the EULA episode of southpark, give your soul to buy an iphone 😛  also how many softwares today abuse this or is less open than they used to be.

Link to comment
Share on other sites

Link to post
Share on other sites

I've been assuming for a while that when one the core cast members of The Simpsons kicks the bucket (Harry Shearer voices Skinner, Burns, Ned Flanders, and a bunch of others and he's 78; Julie Kavner is 71 and already losing control of the Marge voice) they'll just run 30 years of pre-recorded dialogue through an AI algorithm and carry on like nothing happened.

 

Apparently I was drastically overestimating how challenging that would be to do, though. 

Corps aren't your friends. "Bottleneck calculators" are BS. Only suckers buy based on brand. It's your PC, do what makes you happy.  If your build meets your needs, you don't need anyone else to "rate" it for you. And talking about being part of a "master race" is cringe. Watch this space for further truths people need to hear.

 

Ryzen 7 5800X3D | ASRock X570 PG Velocita | PowerColor Red Devil RX 6900 XT | 4x8GB Crucial Ballistix 3600mt/s CL16

Link to comment
Share on other sites

Link to post
Share on other sites

... Honestly I would like that. 

I have a few phone recordings with my mom's voice on them. Would love to be able to hear "her" one more time now that she's been gone for over 2 years.

CPU: AMD Ryzen 3700x / GPU: Asus Radeon RX 6750XT OC 12GB / RAM: Corsair Vengeance LPX 2x8GB DDR4-3200
MOBO: MSI B450m Gaming Plus / NVME: Corsair MP510 240GB / Case: TT Core v21 / PSU: Seasonic 750W / OS: Win 10 Pro

Link to comment
Share on other sites

Link to post
Share on other sites

On 6/22/2022 at 4:42 PM, Arika S said:

No way this will be abused at all. If it only needs a minute of audio what's to stop someone from using it to mimic a person who is still alive's voice?

 

That's the thing with these ai voice systems at the moment.  The training voice is the person reading a consent agreement. can't do that with dead people. 

A minute of audio is not sufficient. Basically from the way TTS has been evolving, you need either

A) 1000 hours of "any" recordings, in any language. The end result is the kind of voice you hear out of Microsoft, Google and Amazon Polly's non-neural voices, basically the average of all sampled recordings. Good enough if you like that 16khz nasally TTS type voice

B) 20 hours of specific recordings, in one language and accent. The end result is the kind of voice Microsoft, Google and Amazon Polly have for neural voices, but the original back-ends of these voices are all based on libritts/ljspeech/VCTK/etc, basically stuff that already exists, and hence why the voices all sound similar and contain no emotional inflections. The output from Amazon Polly at least is 24khz. However Amazon Polly's voices are used frequently and are easily identified by others.

C) 10 hours of a single subject, in one language and accent. The end result can range from awful to reasonably "good enough" to sound like a human until you tell it a joke and it can't laugh.

D) B or C, plus a style-transfer 

 

So if you have 20 hours of someone's voice, or 10 hours of a someone's voice in a specific tone/accent you can style transfer over that voice. This is how you get "bringing back dead people" using only existing samples of their voice. That said, it doesn't work that well. 

 

Regardless of how the TTS voice is trained. No TTS voice can do the following:

1. Laugh

2. Cry

3. Scream

4. React shockingly

5. Yell/raise their voice

6. Whisper

7. Take on a falsetto

8. Sing*

 

This is because the underlying LibriTTS/LJSpeech/VCTK does not contain this data, and likewise CMUDICT does not contain phonemes to do so. In order to have any of these things, there needs to be a way for CMUDICT to indicate something is a "sound" not intended to be read, but still vocalized with different volumes, pitches, and/or speeds. If someone is trying to go "oh-hoh-hoh-ho-ho!" or "yippiee" writing it out is only going to come out as though someone was reading it in the same way you'd read "the quick brown fox jumped over the lazy brown dog."

 

Singing, is an entirely different kind of training. A Singing TTS (eg Vocaloid) can not actually speak, because the way it assembles words is based on pitch, because it's an instrument first. It is possible to make a TTS into a singing TTS, but ... well it's easier to just link it https://github.com/NVIDIA/mellotron

It works, but it's not actually singing, it's just adjusting the pitch and rhythm of "speaking" phonemes into something that isn't quite singing. Basically it sounds like "talking in rythm" rather than singing, as it lacks the ostinato that an instrument has. 

 

Pretty much what Amazon is doing here is "reskinning" an existing TTS base with whatever you give it as a sample, and the result will be something that has the same timbre as the sampled voice, but not the pitch, speed, or ability to emote. Regardless if you train a voice for 1 minute or 1 day, it simply will not sound "as good as Siri", because Siri doesn't do these things either. All it can do is tell jokes with a comedic timing, not laugh at one.

 

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, Kisai said:

 

Regardless of how the TTS voice is trained. No TTS voice can do the following:

1. Laugh

2. Cry

3. Scream

4. React shockingly

5. Yell/raise their voice

6. Whisper

7. Take on a falsetto

8. Sing*

 

This is because the underlying LibriTTS/LJSpeech/VCTK does not contain this data, and likewise CMUDICT does not contain phonemes to do so. In order to have any of these things, there needs to be a way for CMUDICT to indicate something is a "sound" not intended to be read, but still vocalized with different volumes, pitches, and/or speeds. If someone is trying to go "oh-hoh-hoh-ho-ho!" or "yippiee" writing it out is only going to come out as though someone was reading it in the same way you'd read "the quick brown fox jumped over the lazy brown dog."

I was reading an interesting paper a few days ago about using emoji-phonem pairs to train a TTS model to  extrapolate emotive responses from a neutral baseline. If I can find it I'll edit this post with a link to the paper.

Link to comment
Share on other sites

Link to post
Share on other sites

19 hours ago, Caroline said:

Like this is new.

 

Smartphones do it all the time via messaging software and nobody bats an eye. There's no regulation on this and companies can get away with it saying it's to enhance user experience allowing them to spam voicemails in convos instead of texting or some bs like that.

I know that, you missed the sarcasm of my musing but I also get your points here.

As for hearing a deceased loved one's voice, that's just too damn creepy.

I'll throw in what I've said before, Alexa is just AI spyware and in the past it was classified as spyware. Things like Spybot S&D back then worked against it but over time it's been "Evolved" by MS to be something acceptable these days but it still performs the same, core function it was always intended for.
I mean if it's always listening to receive a potential command at any time you know it's actually hearing everything you say, like it or not.

"If you ever need anything please don't hesitate to ask someone else first"..... Nirvana
"Whadda ya mean I ain't kind? Just not your kind"..... Megadeth
Speaking of things being "All Inclusive", Hell itself is too.

 

Link to comment
Share on other sites

Link to post
Share on other sites

How about Norwegian Blue? Does that work for Norwegian Blue?

I edit my posts more often than not

Link to comment
Share on other sites

Link to post
Share on other sites

Ah yes, one of my favorite black mirror episodes

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

15 hours ago, Kisai said:

A minute of audio is not sufficient. Basically from the way TTS has been evolving, you need either

A) 1000 hours of "any" recordings, in any language. The end result is the kind of voice you hear out of Microsoft, Google and Amazon Polly's non-neural voices, basically the average of all sampled recordings. Good enough if you like that 16khz nasally TTS type voice

B) 20 hours of specific recordings, in one language and accent. The end result is the kind of voice Microsoft, Google and Amazon Polly have for neural voices, but the original back-ends of these voices are all based on libritts/ljspeech/VCTK/etc, basically stuff that already exists, and hence why the voices all sound similar and contain no emotional inflections. The output from Amazon Polly at least is 24khz. However Amazon Polly's voices are used frequently and are easily identified by others.

C) 10 hours of a single subject, in one language and accent. The end result can range from awful to reasonably "good enough" to sound like a human until you tell it a joke and it can't laugh.

D) B or C, plus a style-transfer 

Depends, some of the technologies now are actually getting a lot more adept at figuring out the little details.

 

A good example of it is looking at NileGreen (now MrGreen).  The guy behind it iirc didn't even feed in all of NileRed's videos.  Rest assured that companies like Amazon and Google are likely to have even more powerful systems that haven't been released publicly (or might never be public).  Like initially there were actual people who thought NileGreen was actually Nile commenting.

 

Mix that with a baseline reading (of someone else reading to get relative emphasis on words) so that it can figure out rough reading style and you now could easily have something that sounds like a loved one.

 

A sample size of one minute, I could actually see working, it will still get mannerisms wrong, but it might get to the point that you can flag areas that aren't proper and it gives you a demo of different styles and you pick the closest and it will train that into it's network.

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

23 hours ago, Salv8 (sam) said:

this tech shouldn't be developed because of how fucked it can, i don't want my voice being used years after im dead for a fucking scam. i am not ok with that.

Trying to legislate what tech can and cannot be developed is an exercise in futility. Computer code is only limited by our ingenuity, and doesn't care about what's written by some old fogeys on pieces of paper.

My eyes see the past…

My camera lens sees the present…

Link to comment
Share on other sites

Link to post
Share on other sites

On 6/24/2022 at 2:21 PM, wanderingfool2 said:

 

A sample size of one minute, I could actually see working, it will still get mannerisms wrong, but it might get to the point that you can flag areas that aren't proper and it gives you a demo of different styles and you pick the closest and it will train that into it's network.

No, really, it does not. If you are training a TTS. You either need hours and hours to make it sound convincingly like someone, or at best you're just style-transferring one over the other, which doesn't quite do what you think it does.

 

Like I went out of my way to try this two nights ago, and using just 3 minutes of someone's voice to -tune- a TTS, the output resembled a pc speaker, not the person. Style transfer meanwhile gives a more "real" sounding voice, but it's barely anything like the subject.

Link to comment
Share on other sites

Link to post
Share on other sites

Oh boy the ability to deepfake the voice of a dead Grandma!
What could possibly go wrong?

"Billy, I am quite lonely, I want to see you again."
MV5BYTQ1MmI5Y2QtNmZlMS00MGYxLWFlZDgtMDcy
(Twilight Zone: Long Distance Call)

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, DeScruff said:

Oh boy the ability to deepfake the voice of a dead Grandma!
What could possibly go wrong?

"Billy, I am quite lonely, I want to see you again."
MV5BYTQ1MmI5Y2QtNmZlMS00MGYxLWFlZDgtMDcy
(Twilight Zone: Long Distance Call)

*calls up bank of america*

SCAMMER TTS: Hello, my name is (DEAD GRANDMA), and I would like to make a wire transfer to my granddaughter (FAKE NAME)

CS (ASR+TTS): Thank you for calling Bank of America (DEAD GRANDMA), before we can proceed, I need to verify a few quick pieces of information. Is that okay with you?

SCAMMER TTS: Yes, please.

CS (ASR+TTS): What is the billing address that we send your statements to?

SCAMMER TTS: (INFO FOUND ON THE INTERNET OBITUARY)

CS (ASR+TTS): Thank you. And what is the account you would like to transfer from?

SCAMMER TTS: Checking.

CS (ASR+TTS): And what is the account number and routing number of the intended recipient?

SCAMMER TTS: (FAKE NAME's BANK ACCOUNT)

CS (ASR+TTS): Thank you, that transfer will be completed in 3-7 business days. Is there anything else I can help you with?

SCAMMER TTS: No thank you

CS (ASR+TTS): Thank you for calling Bank of America, we appreciate your business and have a nice day.

 

This is the more practical danger. Not so much the need to sound authentic, but that two TTS systems will talk to each other without any human involvement on the part of the one most liable for it. Banking should, first and foremost, always be, in-person, when it involves any amount of money that you wouldn't use a debit card for. 

 

But more than that, US banking systems are especially vulnerable to confidence scams, even when the CSR is a human. Nobody over the phone knows who you are for real. This is really really really why you should never give your banking information out. If your utility company won't take credit, then you should consider a different utility. If they are hacked, that bank account information leaks, and thus the above scenario becomes completely doable.

 

The TTS only makes more likely to pass the "well isn't that you on the phone" check for verification in case someone actually does check. Like many of the "voice contract" things only need a "yes" to be considered valid.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×