Production Expert

View Original

Fraunhofer Intelligibility Meter Used In Nuendo 11

Steinberg has included an Intelligibility Meter in Nuendo 11. This follows iZotope including an intelligibility meter in their metering suite - Insight 2. It turns out that Steinberg’s Intelligibility Meter has been developed bu the very clever people at Fraunhofer. In its article, we investigate what intelligibility is and why it matters and explore how the new Fraunhofer meter used in Steinberg’s Nuendo 2 works.

Scientists at Fraunhofer IDMT are developing solutions to improve the intelligibility of speech in media. These include algorithms for automatic measurement, evaluation and presentation of speech intelligibility facilitate the work of sound engineers in film, game and audio productions. With the implementation of intelligibility technology in Steinberg's current production solution Nuendo, a traffic light now indicates how well the spoken words will be objectively received by listeners.

Older people, in particular, find it difficult or tiring to follow the spoken contributions in a film, radio or TV series. To ensure that dialogue in media can always be mixed to make sure people can easily hear what is being spoken, Steinberg has equipped the latest version 11 of its production solution Nuendo with a new module based on algorithms developed by Fraunhofer IDMT.

When Its In The Red - They Are Mumbling

The new feature analyses incoming audio signals via a speech intelligibility model with automatic speech recognition technology and calculates how much effort the listener must put in to understand the spoken words within the mix. Dr. Jan Rennies-Hochmuth, Head of Personalized Hearing Systems at Fraunhofer IDMT explains…

“The tool Intelligibility Meter measures objective speech intelligibility in media production in real time, controlled by artificial intelligence. It is a results of several years of hearing research in Oldenburg.”

Timo Wildenhain, Head of ProAudio at Steinberg Media Technologies GmbH continues…

“In order to do justice to different hearing preferences and hearing losses associated with demographic change, it is important for producers and sound designers to be able to objectively approximate the actual hearing ability of end-users. We are pleased to offer our customers the Intelligibility Meter, a major enhancement to the Nuendo product.”

The method developed at Fraunhofer IDMT is based on neural networks and enables an automatic and target group-specific measurement of speech intelligibility across applications. Thus, not only sound creators are supported in post-production. It can also be used directly on set or in the field of sound reinforcement technology. Dr. Jan Rennies-Hochmuth explains…

“The technology is also of interest to broadcasters, content providers and manufacturers from the home entertainment, consumer electronics and telecommunications sectors. This is because we are addressing the key factors for poor speech intelligibility throughout the entire production and distribution chain and want to break down existing barriers for the greatest possible variety of target groups, applications and listening situations with the help of innovative software technologies.”

Speech Intelligibility FAQ

Are speech intelligibility and ease of listening, the same thing?

In a strict sense, speech intelligibility is measured as the proportion of speech items (e.g. words) that can be recognized correctly in a given situation. More broadly, the term "intelligibility" is often used to describe the perceived effort one has to spend to understand speech. This is also relevant for broadcast applications, because even if you are technically able to understand every word of a dialog, you may still have to invest a lot of cognitive resources, e.g., when the background sounds are too loud. This broader sense of speech intelligibility is what is measured with Nuendo’s new tool.

What "characteristics" of the speech are being considered to decide if it is intelligible or not?

Speech consists of small building blocks, so-called phonemes. Several phonemes combine to syllables or words. Phonemes are what automatic speech recognition engines detect and convert to meaningful speech. In very clear speech, there is only a single phoneme at a given instant of time. In technical terms, a machine trained to recognize speech detects a high probability for the presence of a specific phoneme and a low probability for all other phonemes. The more disturbed the speech, the less distinct this probability is: The machine is less certain which phoneme is present. This is what is used to quantify intelligibility.

How do you train the AI algorithm?

The algorithm has to perform different tasks. First, it must detect if speech is present or not. This sounds trivial but is a challenging issue when considering how diverse and “speech-like” broadcast background sounds can be. Automatic speech recognition technology is used to compute how certain the recognizer is to detect individual phonemes. Finally, this certainty is mapped to a scale that corresponds to human perception as measured in hundreds of hours of listening experiments. For all this to work robustly, deep learning was exploited with many thousand hours of training material with real speech and highly challenging backgrounds.

See this content in the original post