The company blog post drips with the enthusiasm of a ’90s US infomercial. WellSaid Labs describes what clients can expect from its “eight new digital voice actors!” Tobin is “energetic and insightful.” Paige is “poised and expressive.” Ava is “polished, self-assured, and professional.”
Each one is based on a real voice actor, whose likeness (with consent) has been preserved using AI. Companies can now license these voices to say whatever they need. They simply feed some text into the voice engine, and out will spool a crisp audio clip of a natural-sounding performance.
WellSaid Labs, a Seattle-based startup that spun out of the research nonprofit Allen Institute of Artificial Intelligence, is the latest firm offering AI voices to clients. For now, it specializes in voices for corporate e-learning videos. Other startups make voices for digital assistants, call center operators, and even video-game characters.
Not too long ago, such deepfake voices had something of a lousy reputation for their use in scam calls and internet trickery. But their improving quality has since piqued the interest of a growing number of companies. Recent breakthroughs in deep learning have made it possible to replicate many of the subtleties of human speech. These voices pause and breathe in all the right places. They can change their style or emotion. You can spot the trick if they speak for too long, but in short audio clips, some have become indistinguishable from humans.
AI voices are also cheap, scalable, and easy to work with. Unlike a recording of a human voice actor, synthetic voices can also update their script in real time, opening up new opportunities to personalize advertising.
But the rise of hyperrealistic fake voices isn’t consequence-free. Human voice actors, in particular, have been left to wonder what this means for their livelihoods.
How to fake a voice
Synthetic voices have been around for a while. But the old ones, including the voices of the original Siri and Alexa, simply glued together words and sounds to achieve a clunky, robotic effect. Getting them to sound any more natural was a laborious manual task.
Deep learning changed that. Voice developers no longer needed to dictate the exact pacing, pronunciation, or intonation of the generated speech. Instead, they could feed a few hours of audio into an algorithm and have the algorithm learn those patterns on its own.
“If I’m Pizza Hut, I certainly can’t sound like Domino’s, and I certainly can’t sound like Papa John’s.”
Rupal Patel, founder and CEO of VocaliD
Over the years, researchers have used this basic idea to build voice engines that are more and more sophisticated. The one WellSaid Labs constructed, for example, uses two primary deep-learning models. The first predicts, from a passage of text, the broad strokes of what a speaker will sound like—including accent, pitch, and timbre. The second fills in the details, including breaths and the way the voice resonates in its environment.
Making a convincing synthetic voice takes more than just pressing a button, however. Part of what makes a human voice so human is its inconsistency, expressiveness, and ability to deliver the same lines in completely different styles, depending on the context.
Capturing these nuances involves finding the right voice actors to supply the appropriate training data and fine-tune the deep-learning models. WellSaid says the process requires at least an hour or two of audio and a few weeks of labor to develop a realistic-sounding synthetic replica.
AI voices have grown particularly popular among brands looking to maintain a consistent sound in millions of interactions with customers. With the ubiquity of smart speakers today, and the rise of automated customer service agents as well as digital assistants embedded in cars and smart devices, brands may need to produce upwards of a hundred hours of audio a month. But they also no longer want to use the generic voices offered by traditional text-to-speech technology—a trend that accelerated during the pandemic as more and more customers skipped in-store interactions to engage with companies virtually.
“If I’m Pizza Hut, I certainly can’t sound like Domino’s, and I certainly can’t sound like Papa John’s,” says Rupal Patel, a professor at Northeastern University and the founder and CEO of VocaliD, which promises to build custom voices that match a company’s brand identity. “These brands have thought about their colors. They’ve thought about their fonts. Now they’ve got to start thinking about the way their voice sounds as well.”
Whereas companies used to have to hire different voice actors for different markets—the Northeast versus Southern US, or France versus Mexico—some voice AI firms can manipulate the accent or switch the language of a single voice in different ways. This opens up the possibility of adapting ads on streaming platforms depending on who is listening, changing not just the characteristics of the voice but also the words being spoken. A beer ad could tell a listener to stop by a different pub depending on whether it’s playing in New York or Toronto, for example. Resemble.ai, which designs voices for ads and smart assistants, says it’s already working with clients to launch such personalized audio ads on Spotify and Pandora.
The gaming and entertainment industries are also seeing the benefits. Sonantic, a firm that specializes in emotive voices that can laugh and cry or whisper and shout, works with video-game makers and animation studios to supply the voice-overs for their characters. Many of its clients use the synthesized voices only in pre-production and switch to real voice actors for the final production. But Sonantic says a few have started using them throughout the process, perhaps for characters with fewer lines. Resemble.ai and others have also worked with film and TV shows to patch up actors’ performances when words get garbled or mispronounced.
But there are limitations to how far AI can go. It’s still difficult to maintain the realism of a voice over the long stretches of time that might be required for an audiobook or podcast. And there’s little ability to control an AI voice’s performance in the same way a director can guide a human performer. “We’re still in the early days of synthetic speech,” says Zohaib Ahmed, the founder and CEO of Resemble.ai, comparing it to the days when CGI technology was used primarily for touch-ups rather than to create entirely new worlds from green screens.
A human touch
In other words, human voice actors aren’t going away just yet. Expressive, creative, and long-form projects are still best done by humans. And for every synthetic voice made by these companies, a voice actor also needs to supply the original training data.
But some actors have grown increasingly worried about their livelihoods, says a spokesperson at SAG-AFTRA, the union representing voice actors in the US. If they’re not afraid of being automated away by AI, they’re worried about being compensated unfairly or losing control over their voices, which constitute their brand and reputation.
Some companies are looking to be more accountable in how they engage with the voice-acting industry. The best ones, says SAG-AFTRA’s rep, have approached the union to figure out the best way to compensate and respect voice actors for their work.
Several now use a profit-sharing model to pay actors every time a client licenses their specific synthetic voice, which has opened up a new stream of passive income. Others involve the actors in the process of designing their AI likeness and give them veto power over the projects it will be used in. SAG-AFTRA is also pushing for legislation to protect actors from illegitimate replicas of their voice.
But for VocaliD’s Patel, the point of AI voices is ultimately not to replicate human performance or to automate away existing voice-over work. Instead, the promise is that they could open up entirely new possibilities. What if in the future, she says, synthetic voices could be used to rapidly adapt online educational materials to different audiences? “If you’re trying to reach, let’s say, an inner-city group of kids, wouldn’t it be great if that voice actually sounded like it was from their community?”