Can smart VO talent learn from an intelligent computer?

David Goldberg, Edge Studio

Feb 19 2016

A recent New York Times article reported that IBM linguists, engineers and marketers began in 2009 to determine how they should best design the synthesized voice of Watson, their state-of-the-art Artificial Intelligence computer. What sort of voice would be most pleasing? What should be its “personality”? Stuff like that.

Since then, there has been remarkable progress in the field of voice synthesis, but we doubt the voiceover community at large need to be worried about job security anytime soon. On the other hand, there are things that a live, human talent can learn from the IBM team’s findings.

Where does voice synthesis stand as of 2016?

Although you might be fooled for a bit – say in a weather forecast or driving instructions, even the narration of some nature videos found on YouTube – before long, you’ll realize that you’re listening to a computerized voice. (Bear in mind, we’re talking about purely synthesized voicing, not concatenation of words and phrases spoken by a real person.)

As researchers near their ultimate goal, new difficulties emerge. In the 1960’s, a robotics researcher predicted that as animations closely approached being humanlike, they would be seen as kind of creepy. And, in fact, now that animation technology has reached that benchmark, you’ve probably noted that the prediction was correct– if you’re not expecting animation, or aren’t used to the effect, watching a near-perfect animated human being can be unsettling.

Near-perfect synthetic speech has the same effect. It’s sometimes unnerving
But when speech is from an actual human, what’s not to like? By definition, it is “perfect,” right? Well, yes. And no – at least when the human is reading a script but pretending to speak from the heart. It can come across as unnatural, even if the listener is not quite aware of what the unnatural qualities are.

Here are some of the concerns that computer scientists must deal with, and how they translate to you …

Pronunciation. Unlike a computer, you’re likely to get virtually all of the pronunciation correct. IBM had to quickly create a huge pronunciation database, whereas you’ve been building a pronunciation database all your life.

But it probably contains a few data errors.

When you get a word wrong here or there, your read suffers multiple effects. One, if the listener is aware of the error, it hurts your credibility. Mispronunciation also messes with their concentration, as they think about your mistake, instead of what you’ve said next. It can also disrupt your own concentration, if you’re not quite sure about how to say it. The fix: Look up any word or phrase you’re not familiar with, and if it’s a popular turn of phrase, be sure you (and the scriptwriter) have it right.

Flow and phrasing. The Times article didn’t mention this, but it almost goes without saying. Speech synthesis issues aside, we’re a long way from the point where an artificial intelligence is sentient, at least to the extent where it actually knows what it is saying. So how can it know what words to hit, where to pause, what inflections to use and where, where to speed up, and how best to apply all the other techniques that a professional voice artist uses every day? The lesson here: Observe punctuation and the like, but don’t do it robotically. Don’t do anything in your read robotically. Your judgment and understanding of the listening situation, the material, and your client’s goals play a major role in any read.

Type of voice. As you know, the world of voiceover has a virtually unlimited range of voice types – because each person’s voice is unique, not just in its vocal qualities, but in the speaker’s typical combination of intonation, personality, accent, and other inherent personal qualities. In seeking a prototype for Watson, IBM chose from among 25 voice actors, then tweaked their choice, even toying with making the actor sound like a child. They rejected that, also rejecting hyper-enthusiasm, and finally settled on a sound that was slow, steady, optimistic, and “even a bit peppy.” Your lesson from this? Although surely we don’t want to oversimplify this connection, it seems like what coaches so often advise VO artists-in-training:

Slow down, and smile.

That said, there are certain voice qualities that people generally find annoying. Some are physically inherent, whether the result of genetics, misuse or disease. For example, although Robert Kennedy, Jr., is well worth listening to for what he says, he has a rare vocal condition called spasmodic dysphonia, that makes him sound stilted, extremely strained and hard to listen to, at least until you’ve become used to it. Defining “annoyance” most broadly, this category would also include people who inherently mumble, or have a glottal stop on every initial vowel, or are otherwise disconcerting or difficult to understand. To correct an extreme difficulty (which may or may not be possible) they may require the help of an expert coach. Other annoying vocal qualities are more easily self-fixable. For example, whining or excessive slobbering in a character voice can be a turn-off – the answer to this is to moderate such qualities, use them sparingly, or find an alternative approach.

This relates to computer-generated voicing in a couple of ways. First, just as not every computer is able to support Artificial Intelligence and voice synthesis, not every person is right for voiceover. It could be a matter of their physical limitation. But it’s also a matter of programmability. Just as not every computer is so easily programmable for certain tasks, not everyone is able to take direction.

Odds are, since you’ve read this far, you have been found to have an appropriate “platform” for voiceover. What you need to do then — what every VO professional needs to do — is acquire and regularly update your “programming.”

Emotion. The NY Times article states, “Today, even with all the progress, it is not possible to completely represent rich emotions in human speech via artificial intelligence.” So there’s your cue: incorporate emotion in your read. Expressing emotion appropriately may require training and practice, but compared to a computer, it should come relatively easily for you. This capability is one of the major differences between an “announcer” and a voice actor (the latter being what desirable sophisticated clients want).

The expression of emotion is appropriate in a vast range of VO genres. In audiobooks, of course. In narrating a nature or corporate video, the emotion may be more subdued, but it should be there. And it’s not just about statements like, “The zebra finally dies” or “We’re a great place to work.” Virtually every statement has an implication – it’s a development of the “story” — calling for emotional expression, even if the situation is not so obvious. For example, in telephony, you should be empathetic with the caller’s need, or proudly deliver an on-hold message. In a museum tour, you might convey wonder, amazement, even disdain, as a particular line in the script requires. In commercials, … well, you get the point.

So, although your emotional expression should be subtle, progressing from one feeling to another as you move through the script, you have a great range of emotion to choose from. The Times article notes that experimenters have employed “huge databases of human emotions embedded in speech.” The emphasis is ours. If your emotional range consists of, oh, four varieties, consider expanding it as part of your regular practice, with guidance from your coach.

As the article notes, the problem in getting genuine, appropriate emotional expression from a computer is this: Scientists haven’t yet figured out how to tell a computer to “say this with feeling.”

Well, you’ve just been told.

ADDITIONAL READING:

Creating a Computer Voice That People Like

By John Markoff. Feb 14, 2016.
http://www.nytimes.com/2016/02/15/technology/creating-a-computer-voice-t…

Free Audition Tips

Send a Quick Message

Can smart VO talent learn from an intelligent computer?

About Edge