Production Expert

View Original

TV Subtitle Usage Up To 80% - What Is Going Wrong With Dialogue Mixes?

Recent research is showing that the use of subtitles on TV has continued to grow, with people choosing to use them. Why is this? For audio professionals this seemed like a kick in the teeth, after all if the speech intelligibility of the content we mix is so bad, surely we cannot be doing our job properly? Or are there other factors at play? What is going wrong? In this article we investigate and come to some clear conclusions.

See this content in the original post

It Started With A Tweet

To misquote a well known song, this latest storm about intelligibility started with this tweet, which at the time of writing has over 74,000 likes and has been retweeted over 70,000 times.

The 30-year-old blogger and campaigner from the UK, who prefers to be known by her Twitter name told The Guardian newspaper…

“I was out for lunch with my mum and my phone started going crazy. I was really pleased though, because there was overwhelming global support from people of all ages for subtitles. Even the people who said they didn’t really like them at the cinema said they’d tolerate them if it meant deaf people could attend more screenings.”

But it hasn’t stopped there. In November 2021, to coincide with Caption Awareness Week, UK based Stagetext, a deaf-led charity making the arts a more welcoming and accessible place, commissioned Sapio who interviewed 2,003 people in October 2021 with the results weighted to be representative of Great Britain’s general population. Two-thirds (67%) of the population do not describe themselves as deaf, deafened or hard of hearing.

Take a look at the results…

See this chart in the original post

The usage seems to be inversely proportional to the expected usage groups with 4 out of 5 people aged 18 to 25 using subtitles all or part of the time, whereas less than a quarter of people aged 56 to 75 said they use subtitles all or part of the time, ever though twice as many people in that group declare themselves as deaf, deafened or hard of hearing. Melanie Sharp, Stagetext’s chief executive said…

"I think there's far more acceptance of subtitles by young people because it's the norm, whereas with an older age group, it isn't necessarily the norm."

That is the UK but what about the US? In May 2022, language tutors Preply surveyed 1,265 Americans on their use and opinions of subtitles in entertainment. 49% identified as men, 48% identified as women and 3% identified as non-binary or preferred not to indicate their gender. Of the respondents, 16% were Baby Boomers (58-78), 22% were Generation X (42-57), 46% were Millennials (26-41) and 16% were Generation Z (10-25).

See this chart in the original post

It is interesting to see the correlation between the Preply survey and the Shapio survey of Brits, albeit, that the differences across the age ranges were not as marked in the US survey as they were in the UK survey.

Next, Preply asked about the most common reasons for Americans to use subtitles, which makes for interesting reading…

See this content in the original post

Preply’s results show that overall 53% of Americans are using subtitles more often than they used to, which implies that things have got worse over time. That could be because as people get older their hearing naturally deteriorates but I suspect that is not the main reason.. But what this survey makes very clear is that there are major problems with dialogue intelligibility and its getting worse, not better. So why is this?

The first point I want to make is that there is no one single reason as to why intelligibility has deteriorated in broadcast and OTT content. As with so many things, it is a combination of factors which together ultimately result in normal hearing people have to resort to turning on the subtitles to follow the narrative and we will explore those later in this article.

Before we do that let’s take a closer look at intelligibility. The dictionary definition of intelligibility is…

“the quality or condition of being intelligible - capable of being understood; comprehensible; clear enough to be understood.”

In non-tonal (Western) languages the consonants are really important. The consonants (k, p, s, t, etc.) are predominantly found in the frequency range above 500 Hz and more specifically, in the 2 kHz-4 kHz frequency range. However take a look at the diagram below, there isn’t really a correlation between the amount of energy in the constants in the 2 to 4 kHz band and the importance to intelligibility.

Image courtesy of the DPA Microphone University

What doesn’t help is that it is very difficult to make the constants louder. Try it for yourself, is very difficult. when you project or shout you make the vowels louder but the consonants stay pretty well the same level. The lack of energy in the consonants also makes them much easier to mask or drown out with other sounds like sound effects, foley or music.

If you would like to know more about the science behind intelligibility then check out our article Speech Intelligibility - The Facts That Affect How We Hear Dialog for all the scientific detail.

Back to this article. Next, we are going to look at the reasons that have come together to produce this ridiculous position that normal hearing people are using subtitles.

The Desire For Realism

There continues to be a growing trend towards more realism. Actors, can and should, explore different techniques to portray their characters. However, if this involves "realism" - delivering dialogue in a "realistic" way rather than in a way that can be heard at the back of a theatre, as we have seen from the UK and US research, this isn’t ending well and directors need to understand this. The problem is that going for the realistic approach means that the dialogue, is not likely to make it all the way through the system so that the end users can still understand what is being said.

After all, there is nothing realistic about TV productions, whether documentaries or drama, studio based or on location. So why consider trying taking the realistic approach?

Believable, absolutely! Realistic, definitely not!

When it comes to mixing, I do not believe you can mix TV shows with natural dynamics in the dialogue. If you do, it makes the dialog harder to hear and understand. How you choose to reduce the dialogue dynamic range is up to you. It can be with the faders, clip level or compression. But restricting the dynamics is essential especially for content to be consumed at home or on the move.

Realism is just not possible and I feel that the push towards ‘realism’ is flawed at so many levels. Consider the lighting, the way it is shot, how the story is put together, none of it is real so why apply realism to the sound, it is bonkers!

An extension of this push to ‘realism’ is we end up with a lack of diction from the people speaking. For example, here in the UK a few years ago, in the post 2nd World War drama series - SSGB, there were a number of scenes shot in the dark, where the narrative needed to get across that people needed to whisper so they weren’t overheard and also hiding in shadows and out after dark so they weren’t seen. The problem with this realistic approach is that intelligibility suffers when people cannot see the speaker’s lips moving. There is an art to speaking quietly and still be heard clearly. Back in the day it was called a ‘stage whisper’ but these techniques don’t seem to be taught anymore in drama school, to the extent that we have at least one generation of actors who don’t have this skill anymore.

This push for realism means directors don’t feel it is necessary, but it is a problem and Preply’s research confirms this with 44% of respondents saying that highlighting the impact of low lighting is having on intelligibility.

As shown in the Preply survey, another challenge to understanding what is being said is national and regional accents, with 61% of respondents saying that hard to understand accents are the reason they are using subtitles.

See this chart in the original post

What is interesting is that British TV shows feature heavily in the list of hardest-to-understand TV shows for American audiences with shows like Peaky Blinders, Derry Girls, Downton Abbey, Bridgerton and Doctor Who up near the top.

Strong accents are yet another part of the realism push. Understandably, directors want realistic accents, but if they are too strong then the consumer will struggle to understand what is being said even if everything else is spot on. I am not suggesting for one moment that accents should be banned, just used in moderation, to signpost rather than be realistic. With more and more international productions, especially OTT shows from the like of Amazon and Netflix can mean that a UK regional accent that Brits could easily understand, may not not be understandable by Americans let alone people for whom English is their second or third language, as shown by the Preply survey results. Again this is something that the director needs to take into consideration, as a completely realistic accent, is unlikely to work in an international production.

TV Drama Is Not A Feature Film

I also believe that mixing TV drama like it is a Feature is daft. For example, at night the consumer will almost certainly have the TV volume much quieter, especially if they have young children and so all that quieter stuff won’t be heard and if that includes quiet dialog, the narrative can get lost and they end up turning on the subtitles to be able to follow the story.

Pre-existing Knowledge - Those Involved In The Production All Know What Is Being Said

Another big issue at play as to whether a particular line is intelligible or not, is that everyone involved in the production knows what is being said, they have lived with it through pre-production, script editing, shooting, and post-production. This means they probably know the script as well as the actors, if not better!

What this familiarity with the script means is that they can hear the words even when they are not clearly intelligible. For example, this can happen when the drama is being shot, the director knows what is being said, and even if the sound team asks for a retake it is likely to be received with a hard stare and "I can hear it what's your problem"! When we get to the dub when the director comes to sign off on a scene, again they know what is being said and so may well be asking for the FXs and/or music to be lifted to increase the sense of drama in the scene to a much higher level than they would if they were new to the production and hearing it for the first time.

Changes In Production Techniques - More Multi-camera, Less Use Of Boom Mics

Shooting a scene using more than one camera means that your use of a boom mic is compromised at best, as at least one of the cameras tends to be a wild-shot, meaning the boom mic cannot get in close enough to pick up a clean sound. Consequently location sound teams end up relying on the use of personal radio mics. As we learned in our article Speech Intelligibility - The Facts That Affect How We Hear Dialog the spectrum of speech recorded on the chest of a person normally lacks frequencies in the important range of 2-4 kHz, where the constants are, which results in reduced speech intelligibility.

In fact in this article, we also learnt that just over the head, where the boom mic would normally be, is a great position for getting the best speech intelligibility. All of this means that the growth of multi-camera shoots results in a double-whammy, we lose the use of a boom mic and replace it with personal radio mics often in the chest area, which don’t pick up the consonants as well as the boom mic and as we learnt, speech intelligibility is all about the constants.

Loudness Range Too High

This is an issue that is directly connected to the increased use of subtitles. TV drama is becoming more and more cinematic in style. From a sound perspective, a cinematic style does not translate to a domestic situation where neither the playback system or background noise of the room can be controlled unlike a cinema theatre, where there is complete end-to-end control.

In addition, a domestic environment is a much smaller room and smaller rooms cannot handle louder sounds as well as larger rooms. We must always remember to consider how the content we create is going to be consumed and it what environment.

The Law Of Averages

Rolling back to before loudness normalisation was introduced, although there were issues with loudness jumps, with the peak level normalisation our dialog would often be close to, or at peak level. The outcome of this style were mixes with dialog close to headroom, which meant that not much could go higher than the speech. With content normalised to loudness, and the additional available headroom, which we have with BS 1770 based delivery specs, it seems that there has been an excessive move in how much of a mix is louder than the dialog. This has two outcomes, as more and more of the mix is louder than the dialog when the Integrated loudness is measured, the loudness of the dialog is pushed down relative to the loudness of the complete mix because there is more content in the mix that is louder than the dialog, it has to be that way, it’s the law of averages!

Because there is more content that is louder than the anchor point, usually the dialog, the Loudness Range increases and this is bad for content consumed in a domestic environment. Content with a larger Loudness Range will have a wider range of louder and quieter sounds. Because the dialog loudness gets pushed down relative to the Integrated Loudness, this means is people will set their TV volume so that the loud stuff (often the music) is at a comfortable listening volume but then because of the excessive Loudness Range, the dialog is not loud enough to be able follow and so rather than have to be constantly adjusting the volume up and down the consumer turns on the subtitles.

Reduce The Loudness Range

In my article Has Netflix Turned The Clock Back 10 Years Or Is Their New Loudness Delivery Spec A Stroke Of Genius? I investigated both the Integrated Loudness and Dialog Loudness using the Nugen Audio Dolby Dialog Intelligence Gating algorithm on 4 different programmes, Amazon Prime’s The Grand Tour, The BBC’s Blue Planet and then 2 programmes I had mixed myself. The first of these was Cow Dust Time, which was a documentary for BBC Radio 3. This is the public service classical music channel here in the UK and the house style permits a wider dynamic range, than normal and it was for the strand Between The Ears, which is a strand where the brief positively encourages soundscapes and more sound design, than most radio documentaries. The second of my own mixes was Doctor’s Dementia, which was a more conventional documentary for BBC Radio 4, the public service speech channel here in the UK.

See this content in the original post

What is interesting is that for both The Grand Tour and Planet Earth 2 the Dialog Intelligence measurement correctly reflected the lower level dialog that I picked up in my earlier article Are TV Mixes Becoming Too Cinematic? and produced a normalised dialog-gated loudness of -26.1 LKFS for Planet Earth 2 and -26.3 LKFS for The Grand Tour Ep 2, compared to the R128 full mix measurements of 0 LU (-23 LUFS). Looking at my two speech dominated documentaries the Dialog Gated measurement for Cow Dust Time and Doctor’s Dementia were much closer to the R128 full mix measurement of 0 LU (-23 LUFS).

See this content in the original post

As part of this experiment I also investigated what would happen to the dialogue level if I reduced the LRA. As I was unable to remix some of the programmes in this experiment, I ran all the mixes through LM-Correct 2 from Nugen Audio, which is designed to repurpose content for different platforms. The aim was to reprocess Plant Earth 2 and The Grand Tour Ep 2 down to an LRA of around 10 and then again for an LRA of around 8 and see how that affected the dialogue level using the Dialog Detection option in VisLM 2 from Nugen Audio and here are the results…

As you can see in both cases reducing the LRA of the mix increased the dialog level. Planet Earth 2 started from a much larger LRA, more akin to a Netflix kind of mix, and by bringing the LRA down from 16.5 to 9.5 the dialog level came up from -26.1 to -23.5 making it a much more pleasant listen and one where you wouldn’t need to reach for the remote control.

Cleary I am not the only one to think that LRA matters. Here in the UK, The Digital Production Partnership (DPP) updated their unified UK delivery specs for all UK broadcasters and added this guidance on Loudness Range…

  • Loudness Range - This describes the perceptual dynamic range measured over the duration of the programme - Programmes should aim for an LRA of no more than 18LU

  • Loudness Range of Dialogue - Dialogue must be acquired and mixed so that it is clear and easy to understand - Speech content in factual programmes should aim for an LRA of no more than 6LU. A minimum separation of 4LU between dialogue and background is recommended.

In Canada, the CBC and Radio Canada both now require that the LRA be less than 8 or 10 LU. They also go further and specify that the Integrated loudness for the complete program AND the integrated loudness of the dialogue stem must BOTH be -24 LKFS. Lastly, the momentary loudness must not exceed +10LU above the target loudness. In addition whilst maintaining a -24 LKFS target, the momentary loudness must always remain below -14 LKFS.

Moving onto OTT providers, Netflix in their Netflix Audio Mix Specifications & Best Practices v1.0 provide LRA recommendations. They say…

The following loudness range (LRA) values will play best on the service:

  • 5.1 program LRA between 4 and 20 LU

  • 2.0 program LRA between 4 and 18 LU

  • Dialog LRA of 7 LU or less

  • Difference between FX content and Dialog of 4 LU

The Delivery System

As part of the delivery system, whether it is satellite, digital terrestrial, or OTT, both the sound and picture get heavily data compressed using "lossy" algorithms - most commonly H264 for the video and a variant of AAC for the sound.

A lossy audio codec reduces the data bandwidth needed, by literally throwing away things it thinks you can't hear and once it’s gone, it’s gone. As we learnt, consonants are much quieter than the vowel sounds and so there is a greater chance that key information about the consonants could be thrown away as part of the lossy codec process. but intelligibility is not just about the sound. As we learnt with the McGurk Effect intelligibility can be affected by what we see. or not see.

As we covered in our article Netflix Announce 'Studio Quality' Sound To Their Streaming Service - Find Out More Now. Not long after Scott Kramer joined Netflix as Manager, Sound Technology | Creative Technologies & Infrastructure, they were reviewing ‘Stranger Things 2’ with the Duffer brothers in a living room environment as the brothers like to check how viewers would experience their work. At one point in the first episode, there was a car chase scene that they found didn’t sound as crisp as it had done on the mixing stage.

Even though Scott was new in post, he reported as saying “A lot of it was mushy," and words like “mushy” and "smeared" are ones that Scott and his team found themselves using when describing audio that just isn’t quite as crisp as it should be.

Stranger Things is a very popular series on Netflix and Scott very quickly realised this was something that needed to be “made right”. Netflix pulled in their engineering teams as they were determined to make it right, no matter how much effort it was going to take. The solution to the problem was to deliver a higher bitrate for the audio on Stranger Things 2 but rather than just fix this one series they have been working hard to roll out improved audio more broadly.

It was interesting example of the Netflix culture at work and doing what was needed to support their creative partners. Watch this video to hear the story from the perspective of the Netflix staff including Scott Kramer….

Netflix told us that most TV devices that support 5.1 or Dolby Atmos are capable of receiving better sound. Depending on your device and bandwidth capabilities, the bitrate you receive may vary:

  • 5.1: From 192 kbps up to 640 kbps

  • Dolby Atmos: From 448 kbps up to 768 kbps for subscribers to their Premium plan

You can get the fully story by reading our article Netflix Announce 'Studio Quality' Sound To Their Streaming Service - Find Out More Now. There is no doubt that Netflix would not have spent the time and money increasing the delivery bandwidth if the difference was insignificant.

The Speakers In The TV

This is another area that has come in for a lot of criticism in the press and even at the governmental level. Tom Harper, who was the Director of War and Peace has said that, while he respects the views of sound recordists, in his opinion and experience, if there are audibility problems... 

“They arise at the broadcast and TV reception point, as the soundtrack is played out on reduced bandwidth to two tiny speakers.”

As flat-screen plasma and LED screens have become the norm, there is less and less room for the loudspeakers in the consumer's TV. Back in the good old days of CRT TVs, there was a good-sized cabinet that worked well with a reasonably sized speaker to produce reasonable sound, with a good chance it was also forward-facing.  

With flat-screen TVs and the desire for smaller and smaller bezels, there isn’t anywhere to put the speakers on the front, and so they are often tucked away around the back with very small drivers and then wonder why we get intelligibility complaints.  In our article Speech Intelligibility - The Facts That Affect How We Hear Dialog, we learn that the optimum position for intelligibility is being one metre from the person speaking and both the person speaking and the person listening facing each other. If one is is not facing the other the intelligibility drops off. Similarly, with these slim TVs with speakers round the back the speakers are no longer facing the viewer and so the intelligibility will be further compromised.

As a result we have seen a significant growth in forward facing soundbars to effectively replace the crappy, badly positioned speakers in most LED screen based TVs.

Downmixing

Whilst we are with the consumers tech another contributing factor to intelligibility is downmixing. The delivery specs typically require that the centre channel is reduced by 3dB in the downmix, and while that may be technically correct I wonder if it's sonically the best thing to do, as there's a distinct acoustic difference between the discrete centre channel in 5.1 and the phantom mono centre of a stereo pair of speakers. When you mix on 5.1 do you monitor the stereo downmix, then go back and check and maybe slightly tweak the mix in 5.1? After all, perhaps more than 90% of the viewers will be listening in stereo, which makes it important for us to check the downmix, whether we are required to deliver an LoRo or LtRt stereo mix, or if it is going to be derived in the consumer's equipment.

Phantom Centre

Research has shown that a stereo system with a phantom centre channel will also compromise the intelligibility. This effect is a result of acoustical crosstalk that occurs when two identical signals arrive at the ear with one slightly delayed compared to the other. The resultant comb filtering effect cancels out some frequencies in the audio. Other research has shown a small but measurable improvement in intelligibility by utilising a central loudspeaker for speech instead of a phantom centre.

The Solution

My advice to consumers is to go for a soundbar. This way the sound is anchored to the TV. With a 5.1 system, there are 6 speakers, any number of which could be in the wrong place. I remember going to someone’s house to find they had a domestic 5.1 system with the left and right speakers each side of the TV and the centre and surround speakers propped up along the back of the sofa against the wall, which meant all the dialog was coming from behind you!

What Can We Do About This?

How can normal hearing people feeling it necessary to use subtitles to be able to understand what is being said be resolved? Just as the problems are manyfold, so the there isn’t one quick fix.

The simple answer to the question ‘What Can We Do About This?” is better everything. To answer the question more fully is going to be longer, but one thing is for sure, I don’t believe that this is going to be something a plug-in can fix.

A Better Understanding Of The Many Issues

The single biggest improvement would be a better understanding and appreciation of all the issues, especially for those who have the influence and that tends to be those you are in control of the creative side like the directors and then those who hold the purse strings like the producers and the people commissioning the content. This is especially relevant when it comes to the push towards realism but also in the choice of who and where content should be mixed, how the content is acquired and also in the script and location choices.

A Clever Plug-in?

There has already been one attempt at a plug-in designed to improve the intelligibility. Back in February 2019, Telos Alliance released their AudioTools Voice plug-in which was designed to improve dialog intelligibility. However, following comments that people couldn’t get it to work, the Telos Alliance TV Solutions Group withdrew the product saying…

"We have decided to pull version 1 of AudioTools Voice. V1 lays the framework for the technology, but it is clear to us that user controls and additional features will better the tech. Development work is ongoing for v2 with algorithm enhancements and a host of new features. We will continue to support existing customers, and they will all receive a free upgrade to v2. Stay tuned!"

Intelligibility Meter

An Intelligibility meter would help by at least giving a quantitative measure of intelligibility. There are already two options.

Public Address systems, especially where safety announcements need to be given are required to have a measurable Speech Transmission Index. The Speech Transmission Index reflects how a transmission path affects speech intelligibility; it is a measurement that does not take listeners and talkers into account, but just measures the transmission channel, which means that factors such as hearing loss, poor articulation and other (human) limitations are not taken into account. if you want to know more on this then start out by reading this white paper entitled Speech intelligibility measurements in practice.

Back to media and broadcasting, even though it is not possible to measure the full transmission path as there are way too many variables, most of which we are covering in this article, it is an issue that is engaging developers.

Back in 2018, iZotope recognized this to be an important factor and included an Intelligibility Meter in their audio visualization and metering software - Insight 2. Not for the first time iZotope has been the first to take a concept and turn it into a product. In this case an intelligibility meter for our industry built into Insight 2, which is a first in audio metering.

Being the first, brings its own challenges, in that you have to set the style, and the standard and iZotope has risen to that challenge with the Intelligibility meter in Insight 2. The top meter, which doesn’t have a scale, rather a target to aim for, is very intuitive and it is interesting that the target moves when you change the expected environment the consumer might be in when listening to the content. However the bottom 2 meters with a scale in phons is a throwback to the intelligibility measurements for sound reinforcement and emergency announcement systems, but in the context of broadcasting and OTT, their precise meaning is not clear. For me, there is still some work to do but when you don’t have anything to go on, you have to start somewhere especially when trailblazing.

It is always going to be difficult, but that has never stopped iZotope’s ingenuity before and I am sure they will continue to improve this new concept, which will be so helpful by giving us a quantitative measurement for the intelligibility of our mixes, especially as more and more of the content we mix is being consumed in noisy and challenging locations, with playback systems that do not offer the best sound quality.

Then in December 2020, Steinberg included an Intelligibility Meter in Nuendo 11. It turns out that Steinberg’s Intelligibility Meter has been developed but the very clever people at Fraunhofer.

The new feature analyses incoming audio signals via a speech intelligibility model with automatic speech recognition technology and calculates how much effort the listener must put in to understand the spoken words within the mix. Dr. Jan Rennies-Hochmuth, Head of Personalized Hearing Systems at Fraunhofer IDMT explains…

“The tool Intelligibility Meter measures objective speech intelligibility in media production in real time, controlled by artificial intelligence. It is a results of several years of hearing research in Oldenburg.”

You can learn much more about this intelligibility meter in our article Fraunhofer Intelligibility Meter Used In Nuendo 11.

What About Mixing In the Correct Sized Rooms?

I can understand why Netflix might want a delivery spec with a wider dynamic range because a lot of their content was made for the big screen rather than the small screen. However, it seems they have been transferring the same production values to the content they commision for the small screen is in my view flawed. Commenting on our article Loudness and Dialog Intelligibility in TV Mixes - What Can We Do About TV Mixes That Are Too Cinematic? Reid Caulfield referring to the new Netflix specs said...

“Mixes meant for the "At-Home" environment MUST be mixed - or remixed, if it was originally done in a large theatre - in a near-field environment at 79dB. NOT a large theatre at 85 because someone needed to fit 40 people in the room. And, it cannot be mixed in that large environment simply with the large speaker arrays turned off and the near fields turned on. It needs to be mixed in a much smaller TV-oriented room.”

He then suggests how this could be policed...

By specifying all elements be delivered in a Dolby Atmos-At-Home "wrapper." Even if the show has not been mixed as an Atmos presentation, by specifying delivery as an ADM file, they guarantee that the source room's size data and speaker layout is included in the associated metadata that travels with the data file and program content.

I couldn't agree more about the need to remix cinema content or mixing content commissioned for consumption "At-Home" in smaller spaces at a more appropriate monitor level, like 79. I like his idea of using the Dolby Atmos-At-Home Wrapper as it will include the metadata of the room it was mixed in, which would make it much easier to police.

But until the powers-that-be take up Reid's suggestion, what can we do about mixes that have an excessive LRA for domestic consumption?

Is It Time To Make A Maximum LRA A Requirement To The Spec?

As I have demonstrated, an LRA of anything above 10LU is too high for content created for domestic consumption. As I have said, in my view a maximum of 18 to 20LU is way too high and so maybe it is time to add an LRA figure to the BS 1770 standard? At the very least, it should be a requirement in broadcaster’s delivery specs rather than advisory.

What About Using Object Based Audio?

Another option which definitely helps is the use of Object Based Audio and MPEG-H codecs. Check out these two articles, in which we show how object based audio has applications for delivering content to end users with user friendly controls to adjust what the consumer hears.

See this gallery in the original post

In the article Object Based Audio Can Do So Much More Than Just Dolby Atmos? We Explore we looked at the work that Lauren Ward, a Postgraduate Audio Engineering Researcher, with a passion for Broadcast Accessibility from Salford University. Lauren’s research has been looking at a methodology whereby different audio objects in a piece of content, are scored for how important each object is to the narrative. If an object is essential to the story, like the dialog, or a door opening, they are scored as essential. Other sounds like ambiences and music that add to the narrative but if they weren’t there you would still be able to follow the story are scored progressively less.

Then there is a single control that you can adjust from a full normal mix through to an essential only mix for the very hard of hearing. I have had a chance to try this out on a visit to Salford University and found it very simple and intuitive and the process of scoring of the objects would be very easy to do during the production process.

The single control interface is so much simpler than other personalised options where multiple level controls are presented for each of the objects like commentary, FXs, home crowd, away crowd etc.

Since we published this article, Lauren’s research has moved on with a public-beta experiment, here in the UK. This experiment took a recent episode of BBC One TV medical drama ‘Casualty’ and presented a version of it on the BBC website that includes a slider button in addition to the volume control. Keeping this additional slider on the right-hand side retains the standard audio mix. Moving the slider to the left progressively reduces background noise, including music, making the dialogue crisper. The experiment reached the attention of the UK national press including this article in The Times. This is a time limited demo, that has been extended and at the time of writing is still up on the BBC website for UK based people to access. For everyone else here is a small section of the experiment to show you how simple and effective this is…

Although Casualty is based in an Accident and Emergency department of a large UK hospital, A&E in this context stands for ‘Accessible and Enhanced’ audio. In this BBC project, they are trialling a new feature that allows the consumer to change the audio mix of the episode to best suit your own needs and preferences.

Although the project is aimed at the 11 million Britons with hearing loss and any others who struggle to make out what actors are saying, the UK press spotted that commuters who stream shows on noisy trains and buses could also benefit.

As we showed in our article Is This The Answer To TV Audio Critics? Object Based Audio Case Studies Presented At The AES 146th Convention In Dublin this technology can be built into consumer TVs and as the BBC Casualty experiment shows, web based and streaming services could easily build this into players hosted on smart TVs and then everyone, both normal hearing and hearing impaired people could benefit from this excellent system.

I do not believe this will be difficult to implement, there are, of course two parts, the implementation of the slider at the consumer’s end and the ranking of the content as it is in production. As Lauren explains..

“Our technology adds two things to the process of making and watching a TV programme. The first occurs after filming, when the audio mixing takes place. At this point each sound, or group of sounds, has an importance level attached to it (stored in metadata) by the dubbing mixer or producer.”

You could have a rating system like Avid has in Pro Tools for rating clips. It would be very easy to have a narrative importance rating system in the production process and then for that metadata to be embedded into the delivery stream. Lauren explains…

“Some non-speech sounds, such as the flatlining beep of a heart monitor in Casualty, are crucial to a show’s narrative. The technology allows these noises to stay prominent while non-essential sounds are turned down.”

Object based audio offers the consumer a lot more control and it also offers the content providers with the technology to deliver one stream of object based content and then to use the metadata to render the most appropriate version for the hardware the consumer is using to playback the content.

Conclusion

As we said at the beginning of this article, the reasons for normal hearing people having to resort to using subtitles are many and varied and compound together to make things even worse.

As an audio post-production editor and mixer, I feel we have failed if normal hearing consumers have to turn on subtitles to follow the narrative. It is beholden on us and those with influence and control of the budgets and creative choices to understand the issues and that we all work together to resolve this failure to the consumers we serve.

That’s what I think? What do you think? Please do share your thoughts and experiences in the comments below…

See this content in the original post