Production Expert

View Original

Netflix Loudness And Dynamic Range Developments Update

Netflix continues its work to improve the experience for its subscribers across a variety of devices. In this article, drawn, with permission, from a post on the Netflix Technology Blog, we explore how they measure program dynamic range and then use dialog normalisation and dynamic range reduction to improve the experience for subscribers, especially those using mobile devices.

Dialogue Levels and Dynamic Range

In order to understand loudness management & dynamic range control, it is important to understand what we are controlling. As an example, let’s start with the waveform of a program, shown below in Figure 1.

Figure 1. Example program waveform

Figure 2. Dynamic range of a program with some examples

One of the ways to measure a program’s dynamic range is to break the waveform into half-second segments and compute the RMS level of each segment in dBFS. The summary of those measurements can be plotted on a single vertical line, as shown to the right in Figure 2.

In this example, the ambient sound of a campfire may be up to 60dB quieter than the exploding car in an action scene. The dynamic range of a program is defined as the difference between its quietest and the loudest sounds. So in this example, we would say that the program has a dynamic range of 60 dB.

Loudness is defined as the subjective perception of sound pressure. Although it is largely correlated with sound pressure level, it is also affected by the duration and spectral makeup of the sound. 

There is research that shows that, in cinematic and television content, the dialogue level is a crucial element to viewers’ perception of a program’s loudness.

At this point, we need to highlight that there are two ways to measure and analyse program loudness. One is to measure the full program loudness and then use that data to normalise different content so that they all play at the same loudness. A side effect of using this method is that the dialogue loudness can vary from program to program, whilst the overall loudness will be the same.

The second method, which is what Netflix has chosen to do, as we explained in our article Has Netflix Turned The Clock Back 10 Years Or Is Their New Loudness Delivery Spec A Stroke Of Genius? is to measure the dialogue loudness using the Dolby dialogue-gated system and use that measurement, rather than the overall loudness, to normalise programs. In this method, the dialogue loudness remains the same from program to program. However, this method will often mean that the overall program loudness will vary, for example when you go from an action movie to a speech-based news program or documentary.

Most broadcasters around the world are using a BS1770 based full program normalisation system with delivery spec standards like EBU R128 and ATSC A/85. But a growing number of OTT providers, like Netflix, are turning to dialog normalisation, because for them, dialogue level is the critical component of program loudness, and in graphics in this article, the dialogue level is indicated with a bold black line, as in Figure 2.

When mixed, not every program has the same dialogue level or the same dynamic range. Figure 3 shows a variety of dialogue levels and dynamic ranges for different programs from Netflix.

Figure 3. Typical dynamic range and dialogue levels of a variety of content. Black lines indicate average dialogue level; red and yellow are used for louder/softer sounds.

The action film contains dialogue at -27 dBFS, leaving headroom for loud effects like explosions. On the other hand, the live concert has a relatively small dynamic range, with dialogue near the top of the mix. Other shows have varying dialogue levels and varying dynamic ranges.

Now, imagine you were watching these shows, one after the other. If you switched from the action show to the live concert, you would almost certainly find yourself diving for the volume control to turn it down! Then, when the drama comes on, you might not be able to understand the dialogue until you turn the volume back up. If you were to switch partway through shows, the effect might even be more pronounced. This is what Netflix aim to resolve with dialogue-gated loudness normalisation.

Loudness Management

The goal of loudness management is to play all programs at a consistent loudness, relative to each other, to normalise them. When it is working effectively, once you set your volume to a comfortable level, you should never have to change it, even as you switch from a movie to a documentary, to a live concert. Netflix specifically aims to play all dialogue at the same level. 

The loudness metrics of all Netflix content are measured before encoding. Since their goal is to play all dialogue at the same level, they use anchor-based (dialogue) measurement. The measured dialog level is delivered in MPEG-D DRC metadata in the xHE-AAC bitstream, using the ‘anchorLoudness’ metadata set. In the example from Figure 3, the action show would have an anchorLoudness of -27 dBFS; the documentary, -20 dBFS.

Netflix now streams Extended HE-AAC with MPEG-D DRC (xHE-AAC) to compatible Android Mobile devices running Android 9 or later. With its capability to improve intelligibility in noisy environments, adapt to variable cellular connections, and scale to studio-quality, xHE-AAC has been developed to help Netflix subscribers, who stream on these devices, get the best experience possible, using MPEG-D DRC metadata

On Android devices, Netflix uses KEY_AAC_DRC_TARGET_REFERENCE_LEVEL to set the output level. The decoder applies a gain equal to the difference between the output level and the anchorLoudness metadata, to normalise all content such that dialogue is always output at the same level. In Figure 4, the output level is set to -27 dBFS, where content with higher anchor loudness being attenuated accordingly.

Figure 4. Content from Figure 3, normalised to achieve consistent dialogue levels

Now, in this imaginary playback scenario, you should no longer need to reach for the volume control when switching from the action program to the live concert, or when switching to any other program.

Each device can set a target output level, based on its capabilities and the subscriber’s viewing environment. For example, on a mobile device with small speakers, it is often desirable to use a higher output level, such as -16 dBFS, as shown in Figure 5.

Figure 5. Content from Figure 3, normalised to a higher output level, with peak limiting applied as needed (dark red)

However, this technique is not without problems. Note what happens when you add gain to the action and the thriller programs to achieve the desired output level, the loudest content will be clipped. To prevent this, the decoder must apply peak limiting to prevent the program from being distorted.

Although this is not ideal, it is considered an acceptable tradeoff to achieve a sufficient output level on portable devices. Fortunately, xHE-AAC provides an option to improve peak protection by using metadata and decode-side gain to normalise loudness. But it doesn’t always need to be like that. When conditions are appropriate, like in a home theatre setting, when listening conditions are optimal, Netflix has the option to disable loudness normalisation completely, for a ‘pure’ mode.

Dynamic Range Control

Figure 2 (repeated). Dynamic range of a program with some examples

When playing back content, the goal of dynamic range control is to optimize the dynamic range of a program to provide the best listening experience on any device, in any environment.

Netflix uses the uniDRC() payload metadata, contained in xHE-AAC MPEG-D dynamic range control, to apply a sophisticated DRC but only when it will be beneficial to their subscribers, based on their device and their viewing environment.

Figure 2 is repeated here to save you scrolling back up. As we saw at the start of this article, the example program has a total dynamic range of 60 dB. In a high-end listening environment, like over-ear headphones, home theatre, or cinema, Netflix subscribers will experience both the subtlety of a quiet scene and a bombastic action scene. However, there are many playback scenarios where the reproduction of such a large dynamic range is inappropriate (e.g. low-fidelity earbuds, or mobile device speakers, or playback in the presence of loud background noise).

When the dynamic range of a subscriber’s device and their environment is less than the dynamic range of the content, they will not hear all of the details in the program’s soundtrack. In this scenario, they could end up having to adjust the volume during the show, turning up the soft sections, and then turning it down when things get loud. In extreme cases, they could even not be able to understand the dialogue, even with the volume turned all the way up. These situations are where DRC can be used to reduce the dynamic range of the content to a more suitable range, as shown in Figure 6.

Figure 6. The program from Figure 5, after dynamic range compression (gradient). Note that DRC affects loudest and softest parts, but not dialogue.

Netflix says that to reduce dynamic range in ‘a sonically pleasing way‘ requires a sophisticated algorithm, ideally with significant lookahead. A good DRC algorithm should not affect dialogue levels and only apply a gentle adjustment when sounds are excessively loud or soft for the listening conditions, and the device they are using.

The solution is to calculate the DRC parameters during the xHE-AAC encoding process, whilst there are ample processing power and lookahead. The decoder then simply ‘replays the gain changes as specified in the metadata.

Since listening conditions cannot be predicted at encode time, MPEG-D DRC metadata contains multiple DRC profiles to cover a range of situations like Limited Playback Range (for playback over small speakers), Clipping Protection (only for clipping protection as described below), or Noisy Environment (for … noisy environments).

Figure 3. (Repeated) Typical dynamic range and dialogue levels of a variety of content. Black lines indicate average dialogue level; red and yellow are used for louder/softer sounds.

Peak Audio Sample Metadata

In MPEG-D DRC, ‘samplePeakLevel’ defines the maximum level of a program, which effectively describes the maximum headroom of the program. For example, in Figure 3 (repeated above to save scrolling), the thriller’s ‘samplePeakLevel’ is -6 dBFS.

When the combination of a program’s ‘anchorLoudness’ and a decoder’s target output level results in amplification, as in the action and thriller programs in Figure 3, ‘samplePeakLevel’ allows DRC gains to be used for peak limiting instead of the decoder’s built-in peak limiter. The good news is that since DRC is calculated in the encoder, the outcome is higher fidelity audio than running a peak limiter in the decoder, which will have very limited lookahead.

In Figure 7 below shows how ‘samplePeakLevel’ enables the decoder to replace its peak limiter with DRC for the loudest peaks.

Figure 7. Content from Figure 3, normalised to a higher output level, using DRC to prevent clipping as needed.

Putting it Together

Working together, loudness management and DRC on Netflix can provide an optimal listening experience even in a compromised environment.

Figure 8 illustrates a situation when a subscriber is viewing a program in a noisy environment. The background noise is so loud that everything quieter then -40 dBFS is masked and so completely inaudible, even when using a higher target output level of -16 dBFS reserved for mobile devices.

Figure 8. Content from Figure 7, in the presence of background noise

This example is not even the worst-case. As previously mentioned, in some scenarios, subscribers using small mobile device speakers could be not able to even hear the dialogue because of the very high background noise!

This is where DRC metadata comes to the fore. By engaging DRC, the quietest elements of programs can be boosted enough to be heard over the high background noise, as illustrated in Figure 9 below. Since loudness management has already been used to normalise dialogue to -16 dBFS, the good news is that the DRC has no effect on the dialogue, providing the Netflix subscriber with the best possible experience in less-than-ideal listening situations.

Figure 9. Content from Figure 8, with DRC applied to boost previously-inaudible details.

Seamless Switching and Adaptive Bit Rate

Adaptive video bitrate switching has been a core functionality for Netflix media playback for many years now. However, audio bitrates were fixed, due, in part to lithe limitations of the codecs at the time. The good news is that in 2019, Netflix was able to begin delivering high-quality, adaptive bitrate audio to TVs.

Now in 2021, thanks to xHE-AAC’s native support for seamless bitrate switching, Netflix has been able to bring adaptive bitrate audio to Android mobile devices. Netflix has been able to use a similar approach to what they described in their High-Quality Audio Article, in which the xHE-AAC streams are able to deliver studio-quality audio when network conditions allow, and minimise rebuffers when the network is congested.

Deployment, Testing and Observations

Netflix stresses that they always perform a comprehensive AB test before any major product change, and a new streaming audio codec is no exception. The content was encoded using the xHE-AAC encoder provided by Fraunhofer IIS, packaged using MP4Box, and A/B tested against their existing streaming audio codec, HE-AAC, on Android mobile devices running Android 9 and later.

During these tests, Netflix could see that its subscribers use their device’s built-in speakers, wired headphones/earbuds, or Bluetooth connected devices, which they describe as the ‘audio sinks’. Their tests focused on 3 audio-related metrics and member usage patterns: Time-weighted device volume level, volume change interactions, and ‘audio sink’ changes.

Volume Level

Figure 10. Time-weighted volume level distribution for built-in speakers. (Cell 2: xHE-AAC)

Figure 10 shows the volume level for the built-in speaker ‘audio sink’. The y-axis shows the volume level reported by Android, which is mapped from 0 (mute) to 1,000,000 (max level). The x-axis shows the percentile that had the volume set at or below a particular level. One way to read the graph would be to say that for Cell 2, about 30% of Netflix subscribers had the volume set below 0.5M; for Cell 1, it was about 15%.

Overall, time-weighted volume levels of xHE-AAC are lower; this is expected as the content itself is 11dB louder. They also noted that fewer subscribers have the volume at the maximum level.

Netflix believes that if a subscriber has volume at the maximum level, they may still not be satisfied with the output level. So to see this as a sign that fewer users are dissatisfied with the overall volume level is good news.

Volume Changes

Figure 11. Difference in total volume change interactions (Cell 2: xHE-AAC)

When a show has a high dynamic range, a Netflix subscriber may ‘ride the volume’ to turn down the loud segments and turn up the soft segments. Figure 11 shows that volume change interactions are noticeably reduced when using xHE-AAC. Netflix believes this indicates that DRC is doing a good job of managing the volume changes within shows, especially as these differences are far more pronounced for programs with a high dynamic range.

Audio Sink Changes

On mobile devices, most Netflix subscribers use built-in speakers. When users switch to headphones, Netflix believes it can be a sign that the built-in output level is not satisfactory, and that users hope for a better experience with headphones, perhaps because the dialogue level was not audible.

In their tests, Netflix found that subscribers switched away from built-in speakers 7% less often when listening to xHE-AAC. even better, when the content had a high dynamic range, they switched 16% less.

Conclusions

The lessons Netflix believes the lessons they have learnt while deploying xHE-AAC to Android Mobile devices are not unique. Netflix expects them to apply to other OTT platforms that support the new AAC codec.

Because Netflix strives to give the best subscriber experience, in every listening environment, they advise users that the next time you experience The Crown, get ready to be immersed and not have to reach out to the volume control or grab your earbuds.

For Mike, this Netflix article has made him look at dialogue-gated loudness in a new light as a better way to normalise broadcast and OTT delivered content, compared to normalising to integrated full-mix loudness as found in delivery specs like EBU R128 and ATSC A/85.

However, this article has reiterated Mike’s view that there is a need to reduce the dynamic range of content to suit the listening environment, brought together in his article Loudness and Dialog Intelligibility in TV Mixes - What Can We Do About TV Mixes That Are Too Cinematic?

What’s more, in his experience more and more broadcast and OTT content are being delivered with too much dynamic range with all of the consequences described in this article of volume chasing and low-level content being masked by the environmental background noise, needing loudness management and dynamic range control.

All of this makes the idea of being able to analyse the content at the encoding stage, where processing power and ability to use significant lookahead and to include a range of metadata to suit different users environments is compelling and worthy of more research.

Acknowledgements

We would like to thank Phill Williams and Vijay Gondi from Netflix for a very clear explanation of dialog loudness normalisation, dynamic range reduction and how they are using xHE-AAC to improve the user’s experience.

We would also like to thank Scott Kramer who is Manager, Sound Technology | Creative Technologies & Infrastructure at Netflix for giving us permission to base this article on Production Expert on Phill and Vijay’s article from the Netflix Technology blog.

See this content in the original post