Is it possible to clone a voice? How does it work? How do we do it ethically? In this exclusive article, Damian Kearns interviews the company behind the Emmy Award winning voice cloning software Respeecher and tests the technology for himself. The results have to be heard to be believed. If you’ve ever wondered how your favourite film and TV characters have been de-aged, read this!
The Inspiration
When I listened to Luke Skywalker’s voice in ‘The Mandalorian’ I couldn’t say for sure how the sound team managed to recreate the timbre of Mark Hamill’s voice from 40 years ago so precisely. Having recorded several individuals consistently over the course of decades, I have a number of first hand experiences listening to how voices change over time, both female and male. I know this timbral shift can’t be masked with current software offerings so I did a little reading across the internet and found a few articles, like this one, that provided insight and directed me towards software called Respeecher. Respeecher is the voice cloning software that was used to digitally de-age Mark Hamill’s voice in the Mandalorian.
I was extremely curious so I requested and was quickly granted a meeting with Respeecher’s founder and CEO, Alex Serdiuk, to find out more about the company and the technology. It was an extremely educational Zoom call and I was impressed to find out they had just received an Emmy Award trophy for their work on their groundbreaking projects.
Full disclosure: When I pressed him, Alex declined to comment on their involvement in the Book of Boba Fett at this time, mentioning that the production company will probably make an explanatory ‘making-of piece’ like they did for Mandalorian. He did confirm that they are credited in episode 6.
How Does it Work?
Think of Respeecher as ‘voice swapping’ technology. A cloned voice profile is built using a specific person’s voice using various methods; from existing audio recordings or by talking into a hot mic or reading scripts. These recordings are used to train their artificial intelligence models or neural networks–as they are often called- to replicate the timbre of the voice model. Neural networks are used across a diverse range of audio software these days, as is evidenced in some of the more sophisticated noise reduction software, voice manipulation programs and even reverb modelling programs.
Supervising sound editor for The Mandalorian, Season 2, Matthew Wood, explained in an interview how the tech works.
"It's a neural network you feed information into and it learns," he said. "So I had archival material from Mark [Hamill] in that era. We had clean recorded ADR from the original [Star Wars] films, a book on tape he'd done from those eras (sic), and then also Star Wars radio plays he had done back in that time. I was able to get clean recordings of that, feed it into the system, and they were able to slice it up and feed their neural network to learn this data."
Alex Serdiuk from Respeecher continues the thought:
“Once receiving the data for a target voice (the voice needed to be replicated), Respeecher trains a quite heavy, deep neural net for about two weeks. The system learns the voice to be able to replicate it from another voice input. Respeecher’s technology is designed to convey all the emotions from a source speaker, changing the timber to a desired one.”
After the voice model is constructed, an actor, producer, director, editor or other sound engineer can simply record performances of the lines they want to construct the way they want them performed, and send the files off to Respeecher. Proximity to the microphone and vocal performance do matter, since pitch, proximity and inflection are taken from the source speaker and conveyed by the voice model.
This is not standalone software that can be downloaded and used locally. Respeecher’s base of operations handles all the processing and sends back the files once the voice clone versions have been created by Respeecher and the results are ‘transformative’ to say the least.
Respeecher provides a user interface and has API integrations with some clients looking for greater control. Most of feature/TV work is being done manually though, as Respeecher’s sound team is capable of tweaking models to meet the best possible quality and address some creative expectations.
A Demonstration
If you’re curious, check out Respeecher Voice Marketplace to hear some examples of exactly what this technology can accomplish. Scroll down the page a little, and you can select voices, by gender, age, and pitch to hear exactly how each voice has been applied to a sample file.
I was permitted to try out Respeecher Voice Marketplace and it was without a doubt one of the most fantastical audio moments of my career for me. I have long dreamed of being able to modulate someone else’s voice using my own and to some degree, I’ve done it here and there to help ADR or dialogue edits achieve a desired inflection. This is something different though. My voice was no longer my voice, it was just a tool to push another voice in the direction I wanted it to go.
Before I began, Alex’s advice to me was…
“To make sure you use the VM in the most efficient way - You should recalibrate the system each time you change voice or recording conditions. And also - there is a pitch control embedded, feel free to use it to pick the best tone in conversions.”
Good advice, though in this case, I really didn’t change much since I used my voice and the same mic for my experiments.
I picked the maximum 5 voices from the selections in Respeecher Voice Marketplace and chose not to do any pitch correction for the purpose of these tests. I swapped the 5 voices for 5 more a couple of times, just to hear the different choices. I’ll freely admit it was a lot of fun and I spent far longer than I needed to on these experiments because there was so much joy in testing everything.
Have a listen to the files I made using the ‘TakeBaker’ web browser interface. What you’re hearing here is a selection of the models I applied after creating randomly spoken test files using my Sennheiser MKH 416.* What I hear in the clones is my original inflection and the proximity changes I made as I was speaking. Stunning! It’s not just human voice models I tried either. Check me out as a rooster, they also have cats and dogs as options.
*For these experiments, I decided to go with my own speech but there is a ‘text to speech’ option available.
The second recording I made started 10 feet away from the microphone and I walked straight up to it through the duration. Amazingly, the proximity of the human models changed somewhat according to my proximity and their models. The room reverb presence isn’t replicated but certainly in the models I felt them getting closer, as I did. I wouldn’t want the room anyway, as I’d want to add that in the mix.
At some point I really didn’t know how my Canadian accent was factoring into the modelling so I decided to do my impression of my Irish Grandad to see how Respeecher handles accents. Joe’s accent, it must be said, was somewhat diluted over the decades he spent living in England so he didn’t have a wholly Irish accent but my ‘voice model’ has long been held up by my family as a nearly perfect copy. I figured this was the only politically correct accent I could muster– all in the name of software testing- since I pay homage to my ancestry every time I do this particular impression and I loved my Grandad very much. I was shocked to hear that the models don’t care which accent I had, as they take on any accent they learn. As it turns out, Respeecher is also completely language agnostic, meaning the clones will speak with any accent, in any language. And the models even mimic my breathing!
Some of the criticism I read about Respeecher’s ability to synthesise Mark Hamill’s voice patterns can be addressed right here with my examples. I can tell you for certain that Respeecher copied my speech pattern and my inflection, the way I performed it. There are some artefacts here and there that would need to be addressed with alternate performances and possibly dialogue editing tools but I believe the voices I chose that are closer to my speaking range sound quite clean. It’s all usable and I must say, the timbral contrast between the clone voices and my own voice is surreal. I even stacked some of the different voices on top of each during a playback to create some unique vocal processing. Feel free to try this with the test materials I’ve provided.
Browser-based interfaces like the one used for this demo can be a little slow and likely nowhere near the quality of the full Respeecher treatment but it’s effective not only to demonstrate what Respeecher can do, but also as a potential tool for filling in voices for groups in various situations where group walla recording isn’t practical. For $200 per month, having access to this range of cloned voice timbres might well be suitable for certain productions; for instance, in productions where finding groups of voice actors who are of a specific ethnicity cannot be found. Hiring a single person with the right speaking pattern could mean filling out the rest of a group sound can be achieved inoffensively and with cultural sensitivity. The question is, is this approach ethical?
The Necessary Ethics Conversation
There’s a tightrope walk the people behind Respeecher feel must be tiptoed through with this sort of technology. One imagines dictators, politicians, thieves, con artists, and bullies of all sorts would love to be able to achieve their damaging, nefarious goals, armed with such imitative power. But then again, in the wrong hands, any of humanity’s inventions can be used as weapons. So potential for weaponizing something, if you ask me, isn’t a reason not to create, if the vast majority of people have no such malevolent desires.
There’s the suggestion as well that voice talent might have reason to worry their livelihoods are in jeopardy. Again, Respeecher isn’t about replacing the voiceover actor, it’s about cloning for the purpose of resurrection or construction and about achieving creative goals.
As Alex says:
“Voice Marketplace becomes appreciated by voice actors, as they are no longer limited to their vocal timbre. They can take different jobs, and at the end of the day, their acting and ability to perform is what matters. You can also check this podcast with a well known voice actress, Anne Ganguzza, where they speak with me about voice cloning specifically for voice actors:. Also the tech allows voice actors to sound exactly like they sounded many years ago.
From the wheel, to the steam engine, to the radio, to TV, to the internet, bad people have managed to pervert inventions that have otherwise pushed social evolution forward; harnessing our brightest ideas to work against the better nature shared by the majority of people.”
To quote the Respeecher webpage:
“Some ethical questions about synthetic speech are easy, but others are hard. We don't just rely on our gut to tell us what is right. This set of principles guides our decision making.
Respeecher does not allow any deceptive uses of our technology.
Respeecher does not use voices without permission when this could impact the privacy of the subject or their ability to make a living. In practice, this means we will never use the voice of a private person or an actor without permission. In a handful of cases we have used the voices of historical figures such as Richard Nixon and Barack Obama without permission but non-deceptively and for the purposes of showing what the technology can do. While we will listen to requests, we are generally not open to doing new projects of this nature.
It's all well and good to have strong principles, but how can we ensure that they are not violated?
Respeecher does not provide any public API for creating new voices.
Respeecher works directly with clients we trust.
Respeecher requires written consent from voice owners.
Respeecher only approves projects that meet our strict standards.
Respeecher is developing watermarking technology that allows us to easily tell Respeecher-generated content from other content, even if it is disguised by being mixed in with other audio.”
Beyond this, Respeecher, as a company, is actively involved in several ethics-focussed groups. For more on this, CEO Alex Serdiuk has elaborated exclusively for this article:
“Respeecher has been proactive in building right ethical guidelines and frameworks around synthetic speech since we started. We have strict rules and one can’t use our services if they don’t have permission from a voice owner. Also, Voice Marketplace doesn’t allow introducing any target voice. Users are limited to the voices from our library.
Technology itself is neutral - it is neither good or bad. Technology is an instrument, which like many other instruments can be used for bad purposes to fool or harm people.
As a leader in synthetic speech space we participate in a number of worldwide initiatives to help different industries adopt new technologies in the most ethical way. Respeecher participates in Open Voice Network (Linux Foundation), Synthetic Futures, DEG committees to name a few.
Also we are launching a new initiative in order to unite the greatest minds for creating the best synthetic speech detector. More news on this is coming soon.
We speak at a lot conferences and public events, to educate society that technology like ours could and would be used by bad actors and we should double check information we receive in the media, to be reasonably critical about what is presented as facts in media channels. This is not new to us, fake news is not wholly about “deepfake” technologies, they are about lies and manipulation - a quite common concept that has followed humans for millennia.”
Final Thoughts
So, what’s my take on the ethics of this technology?
I think Respeecher is a fantastic audio development. Sure, the voice cloning technology itself has the potential for harm but inside the highly controlled usage model this company has built, AI is merely a conduit for creative endeavour. Voice talent need not fear, as Respeecher’s goal isn’t to replace or replicate actively working human beings. But who knows? Someone might want to clone their voice so that when they retire, they can continue to entertain and potentially earn money for decades to come.
Other companies offer text-to-speech or voice replacement strategies that are limited in scope but the idea that we are changing our voices to something else is still at the heart of these alternate offerings. The issue I have with many of the competing voice replacement strategies is the relative ease of access to the tech. In some cases, it’s simply too easy to pretend to be someone else and that’s not only less ethical, it can lead to more complicated social situations involving crime or identity theft.
This company’s creative control makes Respeecher the superior voice cloning tool, occupying the ethical moral high ground more people should aspire to. I think they’re doing everything right. They’ve made a singular technological contribution and instead of selling it to the world, Respeecher has said, come to us. With written permission from the voice provider and a decent set of original recordings in place, Respeecher can give you back something quite unique, quite special; new, fresh words, created by capturing echoes from the past.
In this sense, Respeecher defines what is best about our human technology: It is a tool created to help us do something new and different and is firmly held in the principled, right hands for this job.
If you’re interested in learning more about Respeecher, head over to respeecher.com and read everything you can, including the testimonials. I’m really glad I took the time to delve into this technology and I’d like to thank Alex and his company for contributing and providing assistance for this article. It certainly was a voyage of discovery for me.