In an onion-esque headline - Amazon demonstrates mimicking a dead relative's voice to “make memories last”

Ziad. · June 22, 2022

Summary

Amazon announced at its re:MARS conference today that it's working on feature that can mimic a person's voice from just a minute of reference speech.

Quotes

Quote

In a demonstration video, a child said, “Alexa, can Grandma finish reading me the Wizard of Oz?”

Alexa confirmed the request with the default, robotic voice, then immediately switched to a softer, more humanlike tone, seemingly mimicking the child’s family member.

The Alexa team developed a model that allows its voice assistant to produce a high-quality voice with “less than a minute of recorded audio,” Prasad said.

The feature is currently in development, Prasad said. Amazon did not say when the feature will roll out to the public.

My thoughts

The first thought that rolled into my mind when reading that headline was "holy **** that's creepy". Then i remembered LTT's video on AI voice synthesizing and that put my nerves at ease. While it seems like a cool idea i doubt today's tech can replicate a human's voice accurately, let alone with just 60 seconds of reference audio. If Amazon does somehow accomplish accurate voice synthesizing i believe this can be kind of a dangerous tool to be in the hands of anyone with an Echo.

Sources

CNBC: Amazon demonstrates Alexa mimicking the voice of a deceased relative. https://www.cnbc.com/2022/06/22/amazon-demonstrates-alexa-mimicking-the-voice-of-a-deceased-relative.html

Arika · June 22, 2022

No way this will be abused at all. If it only needs a minute of audio what's to stop someone from using it to mimic a person who is still alive's voice?

That's the thing with these ai voice systems at the moment. The training voice is the person reading a consent agreement. can't do that with dead people.

da na · June 22, 2022

Is it just me or does anyone else think that having an AI say things this person never actually said in their voice would not help cope with the loss of a loved one in the slightest? I understand voice recordings of the actual person, yes. But this seems a bit... far?

Quackers101 · June 23, 2022

this is why they record everything in their products, jk /s

Beerzerker · June 23, 2022

Recording and storing vocal patterns....
Hmmmm.

05032-Mendicant-Bias · June 23, 2022

Pairing a dear one voice with a chatbot seems ripe for abuse, e.g. by automating frauds and scams.
On the plus side it might have therapeutic applications when used by professionals.

I would like for regulations that makes it clear one is talking with a computer, e.g. with a periodic reminder: "Dear, I remind you that this is a piece of code mimiking your dead grandma voice."

Salv8 (sam) · June 23, 2022

14 hours ago, 05032-Mendicant-Bias said:

On the plus side it might have therapeutic applications when used by professionals.

no it wouldn't.

this would make letting go even harder maybe to the point of delusional belief that the loved one is still there in your echo.

imagine a child trying to accept the loss of their grandma with this:

Quote

dad: hey son, i know you miss grandma but we're here for you even if she can't be.
kid: hey alexa! make grandma say "i love you"!

alexa: i love you!
kid: see dad, grandma's still here! she's in the alexa!

this can happen without an alexa, children can attach their loved ones to an inanimate object such as a toy (even more so if the toy was hand made by the loved one) believing that they are living within it.
getting them to move on from this is already a nightmare even for trained therapists knowledgeable in this area, an alexa would make this already bad problem even worse!

regarding the legal and technical area. it's a fucking gray area. disney used cgi for Peter Cushing in rouge one, sight issue, he's been dead for over 20 years how the hell do we know he's ok with this?!

this tech shouldn't be developed because of how fucked it can, i don't want my voice being used years after im dead for a fucking scam. i am not ok with that.

Quackers101 · June 23, 2022

3 minutes ago, Salv8 (sam) said:

it's a fucking gray area. disney used cgi for Peter Cushing in rouge one, sight issue, he's been dead for over 20 years how the hell do we know he's ok with this?!

it should be illegal. but only on death, unless a given permission? or about recreation, youth and other stuff. or just a character generated that looks very similar.

if they just startys firing people, take their digital "ID" and use it to create both in body and voice. should break a lot of privacy rules and regulations, and lot of shady deals in how its "accepted".

back to the EULA episode of southpark, give your soul to buy an iphone also how many softwares today abuse this or is less open than they used to be.

Middcore · June 23, 2022

I've been assuming for a while that when one the core cast members of The Simpsons kicks the bucket (Harry Shearer voices Skinner, Burns, Ned Flanders, and a bunch of others and he's 78; Julie Kavner is 71 and already losing control of the Marge voice) they'll just run 30 years of pre-recorded dialogue through an AI algorithm and carry on like nothing happened.

Apparently I was drastically overestimating how challenging that would be to do, though.

TetraSky · June 24, 2022

... Honestly I would like that.

I have a few phone recordings with my mom's voice on them. Would love to be able to hear "her" one more time now that she's been gone for over 2 years.

Kisai · June 24, 2022

On 6/22/2022 at 4:42 PM, Arika S said:

No way this will be abused at all. If it only needs a minute of audio what's to stop someone from using it to mimic a person who is still alive's voice?

That's the thing with these ai voice systems at the moment. The training voice is the person reading a consent agreement. can't do that with dead people.

A minute of audio is not sufficient. Basically from the way TTS has been evolving, you need either

A) 1000 hours of "any" recordings, in any language. The end result is the kind of voice you hear out of Microsoft, Google and Amazon Polly's non-neural voices, basically the average of all sampled recordings. Good enough if you like that 16khz nasally TTS type voice

B) 20 hours of specific recordings, in one language and accent. The end result is the kind of voice Microsoft, Google and Amazon Polly have for neural voices, but the original back-ends of these voices are all based on libritts/ljspeech/VCTK/etc, basically stuff that already exists, and hence why the voices all sound similar and contain no emotional inflections. The output from Amazon Polly at least is 24khz. However Amazon Polly's voices are used frequently and are easily identified by others.

C) 10 hours of a single subject, in one language and accent. The end result can range from awful to reasonably "good enough" to sound like a human until you tell it a joke and it can't laugh.

D) B or C, plus a style-transfer

So if you have 20 hours of someone's voice, or 10 hours of a someone's voice in a specific tone/accent you can style transfer over that voice. This is how you get "bringing back dead people" using only existing samples of their voice. That said, it doesn't work that well.

Regardless of how the TTS voice is trained. No TTS voice can do the following:

1. Laugh

2. Cry

3. Scream

4. React shockingly

5. Yell/raise their voice

6. Whisper

7. Take on a falsetto

8. Sing*

This is because the underlying LibriTTS/LJSpeech/VCTK does not contain this data, and likewise CMUDICT does not contain phonemes to do so. In order to have any of these things, there needs to be a way for CMUDICT to indicate something is a "sound" not intended to be read, but still vocalized with different volumes, pitches, and/or speeds. If someone is trying to go "oh-hoh-hoh-ho-ho!" or "yippiee" writing it out is only going to come out as though someone was reading it in the same way you'd read "the quick brown fox jumped over the lazy brown dog."

Singing, is an entirely different kind of training. A Singing TTS (eg Vocaloid) can not actually speak, because the way it assembles words is based on pitch, because it's an instrument first. It is possible to make a TTS into a singing TTS, but ... well it's easier to just link it https://github.com/NVIDIA/mellotron

It works, but it's not actually singing, it's just adjusting the pitch and rhythm of "speaking" phonemes into something that isn't quite singing. Basically it sounds like "talking in rythm" rather than singing, as it lacks the ostinato that an instrument has.

Pretty much what Amazon is doing here is "reskinning" an existing TTS base with whatever you give it as a sample, and the result will be something that has the same timbre as the sampled voice, but not the pitch, speed, or ability to emote. Regardless if you train a voice for 1 minute or 1 day, it simply will not sound "as good as Siri", because Siri doesn't do these things either. All it can do is tell jokes with a comedic timing, not laugh at one.

James5382 · June 24, 2022

7 hours ago, Kisai said:

Regardless of how the TTS voice is trained. No TTS voice can do the following:

1. Laugh

2. Cry

3. Scream

4. React shockingly

5. Yell/raise their voice

6. Whisper

7. Take on a falsetto

8. Sing*

This is because the underlying LibriTTS/LJSpeech/VCTK does not contain this data, and likewise CMUDICT does not contain phonemes to do so. In order to have any of these things, there needs to be a way for CMUDICT to indicate something is a "sound" not intended to be read, but still vocalized with different volumes, pitches, and/or speeds. If someone is trying to go "oh-hoh-hoh-ho-ho!" or "yippiee" writing it out is only going to come out as though someone was reading it in the same way you'd read "the quick brown fox jumped over the lazy brown dog."

I was reading an interesting paper a few days ago about using emoji-phonem pairs to train a TTS model to extrapolate emotive responses from a neutral baseline. If I can find it I'll edit this post with a link to the paper.

Beerzerker · June 24, 2022

19 hours ago, Caroline said:

Like this is new.

Smartphones do it all the time via messaging software and nobody bats an eye. There's no regulation on this and companies can get away with it saying it's to enhance user experience allowing them to spam voicemails in convos instead of texting or some bs like that.

I know that, you missed the sarcasm of my musing but I also get your points here.

As for hearing a deceased loved one's voice, that's just too damn creepy.

I'll throw in what I've said before, Alexa is just AI spyware and in the past it was classified as spyware. Things like Spybot S&D back then worked against it but over time it's been "Evolved" by MS to be something acceptable these days but it still performs the same, core function it was always intended for.
I mean if it's always listening to receive a potential command at any time you know it's actually hearing everything you say, like it or not.

Tan3l6 · June 24, 2022

How about Norwegian Blue? Does that work for Norwegian Blue?

Sauron · June 24, 2022

Ah yes, one of my favorite black mirror episodes

wanderingfool2 · June 24, 2022

15 hours ago, Kisai said:

A minute of audio is not sufficient. Basically from the way TTS has been evolving, you need either

A) 1000 hours of "any" recordings, in any language. The end result is the kind of voice you hear out of Microsoft, Google and Amazon Polly's non-neural voices, basically the average of all sampled recordings. Good enough if you like that 16khz nasally TTS type voice

B) 20 hours of specific recordings, in one language and accent. The end result is the kind of voice Microsoft, Google and Amazon Polly have for neural voices, but the original back-ends of these voices are all based on libritts/ljspeech/VCTK/etc, basically stuff that already exists, and hence why the voices all sound similar and contain no emotional inflections. The output from Amazon Polly at least is 24khz. However Amazon Polly's voices are used frequently and are easily identified by others.

C) 10 hours of a single subject, in one language and accent. The end result can range from awful to reasonably "good enough" to sound like a human until you tell it a joke and it can't laugh.

D) B or C, plus a style-transfer

Depends, some of the technologies now are actually getting a lot more adept at figuring out the little details.

A good example of it is looking at NileGreen (now MrGreen). The guy behind it iirc didn't even feed in all of NileRed's videos. Rest assured that companies like Amazon and Google are likely to have even more powerful systems that haven't been released publicly (or might never be public). Like initially there were actual people who thought NileGreen was actually Nile commenting.

Mix that with a baseline reading (of someone else reading to get relative emphasis on words) so that it can figure out rough reading style and you now could easily have something that sounds like a loved one.

A sample size of one minute, I could actually see working, it will still get mannerisms wrong, but it might get to the point that you can flag areas that aren't proper and it gives you a demo of different styles and you pick the closest and it will train that into it's network.

Zodiark1593 · June 24, 2022

23 hours ago, Salv8 (sam) said:

this tech shouldn't be developed because of how fucked it can, i don't want my voice being used years after im dead for a fucking scam. i am not ok with that.

Trying to legislate what tech can and cannot be developed is an exercise in futility. Computer code is only limited by our ingenuity, and doesn't care about what's written by some old fogeys on pieces of paper.

Kisai · June 25, 2022

On 6/24/2022 at 2:21 PM, wanderingfool2 said:

A sample size of one minute, I could actually see working, it will still get mannerisms wrong, but it might get to the point that you can flag areas that aren't proper and it gives you a demo of different styles and you pick the closest and it will train that into it's network.

No, really, it does not. If you are training a TTS. You either need hours and hours to make it sound convincingly like someone, or at best you're just style-transferring one over the other, which doesn't quite do what you think it does.

Like I went out of my way to try this two nights ago, and using just 3 minutes of someone's voice to -tune- a TTS, the output resembled a pc speaker, not the person. Style transfer meanwhile gives a more "real" sounding voice, but it's barely anything like the subject.

DeScruff · June 28, 2022

Oh boy the ability to deepfake the voice of a dead Grandma!
What could possibly go wrong?

"Billy, I am quite lonely, I want to see you again."

(Twilight Zone: Long Distance Call)

Kisai · June 28, 2022

1 minute ago, DeScruff said:

Oh boy the ability to deepfake the voice of a dead Grandma!
What could possibly go wrong?

"Billy, I am quite lonely, I want to see you again."

(Twilight Zone: Long Distance Call)

*calls up bank of america*

SCAMMER TTS: Hello, my name is (DEAD GRANDMA), and I would like to make a wire transfer to my granddaughter (FAKE NAME)

CS (ASR+TTS): Thank you for calling Bank of America (DEAD GRANDMA), before we can proceed, I need to verify a few quick pieces of information. Is that okay with you?

SCAMMER TTS: Yes, please.

CS (ASR+TTS): What is the billing address that we send your statements to?

SCAMMER TTS: (INFO FOUND ON THE INTERNET OBITUARY)

CS (ASR+TTS): Thank you. And what is the account you would like to transfer from?

SCAMMER TTS: Checking.

CS (ASR+TTS): And what is the account number and routing number of the intended recipient?

SCAMMER TTS: (FAKE NAME's BANK ACCOUNT)

CS (ASR+TTS): Thank you, that transfer will be completed in 3-7 business days. Is there anything else I can help you with?

SCAMMER TTS: No thank you

CS (ASR+TTS): Thank you for calling Bank of America, we appreciate your business and have a nice day.

This is the more practical danger. Not so much the need to sound authentic, but that two TTS systems will talk to each other without any human involvement on the part of the one most liable for it. Banking should, first and foremost, always be, in-person, when it involves any amount of money that you wouldn't use a debit card for.

But more than that, US banking systems are especially vulnerable to confidence scams, even when the CSR is a human. Nobody over the phone knows who you are for real. This is really really really why you should never give your banking information out. If your utility company won't take credit, then you should consider a different utility. If they are hacked, that bank account information leaks, and thus the above scenario becomes completely doable.

The TTS only makes more likely to pass the "well isn't that you on the phone" check for verification in case someone actually does check. Like many of the "voice contract" things only need a "yes" to be considered valid.

Sign In

In an onion-esque headline - Amazon demonstrates mimicking a dead relative's voice to “make memories last”

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites