Production Expert

View Original

What You Need To Know About Audio Metadata In Broadcast

In this article, Damian Kearns and Michael Nunan discuss the role and detail of metadata in the broadcast chain, with particular reference to AC-3. If you work in Post Production you’ve ever wondered what metadata is and why it’s important this discussion is essential reading.

What Is An Expert ?

What is an ‘expert’ anyway? I’d been wondering about this since I first started regularly visiting this site with ‘Expert’ in the name. It seems to me, for instance, Pro Tools is a workstation used by a wide range of audio aficionados, across the vast spectrum of creative endeavours. So what makes anyone an expert on anything audio related?

I believe expertise lies not just in the mastery of a tool; in part, it lies in the potential to use this tool not only as a thing in and of itself but as a conduit from creative thought to creative expression. If all goes well, what is perceived by our listeners is exactly what was intended. This is where expertise and genius become one and the same, in the successful manifestation of one’s creative intent. 

In this second conversation with my friend, Michael Francis Nunan, I hope we can all take another step together towards ‘expertise’ in delivering high-quality broadcast audio experiences. We intend to build upon our first conversation here on Pro Tools Expert. What I personally want to know more about is where my mix goes and what processes it undergoes before it launches out to TV’s across the country or around the globe. Knowing we have an international readership, I will try to keep this conversation relevant to all. 

Damian: Hi again, Michael! Thanks for sitting down with me again to talk tech. If our last conversation was all about framing the broadcast chain, I’m hoping this conversation will start to add in some of the detail and I know, as Senior Manager for Broadcast Audio and Post Production Operations at Bell Media, a nationwide broadcaster across the entire spectrum here in Canada, you are the man with whom to have this conversation.

Let’s start after the Quality Control (QC) process so we can slowly work our way around the broadcast ecosystem. We’ve established that in the current broadcast environment, my 5.1 surround mix is ultimately what accompanies the picture out through the airwaves and into our TV’s and set-top boxes.

What About Metadata?

Michael, my files are always broadcast .wav files (BWF), 24 bit, 48 kHz. The BWF format doesn’t differ sonically from its predecessor or from an AIFF file but it does contain chunks of metadata that can describe a variety of things like which channel is which in a multichannel sound file, a timestamp and other information about levels etc. Does any of this information contained in the .wav files I deliver aid the broadcast process?

Michael: A conditional “Yes”. Metadata always plays a role in the predictability of our content, but there’s a level of directness that is modified by where in the ecosystem a WAV or BWF (Broadcast Wave File) enters the environment. In most cases, a naked audio-only file (such as might be the output of a Post Sound process as empowered by ProTools) rarely enters the Broadcast system directly; they generally first go to a Picture Editor (or similar functionary) for the modern equivalent of “layback”, where the finished audio is married to the final picture. It’s this final picture ‘Master’ which tends to be handed off to the broadcaster. In this case, your metadata should be helping to ensure that the editor is able to combine your mix with the picture correctly and without error or need for interpretation (or guesswork).

In some cases though, especially with regard to alternate mixes/versions (e.g. alternate languages, Descriptive Video commentary (DV), etc), you may be delivering WAV files directly into the Broadcast system. In these cases, metadata still performs the same functions - largely aimed at ensuring correct channelization and positional synchronization. It’s important to remember that in both of these cases, we’re specifically not talking about metadata, which informs how your mix will be experienced by a listener/viewer. That is, issues like fold-down parameters and loudness normalization (to say nothing of more esoteric elements like Channel/Object behaviour) are not addressed by the kind of metadata that can be carried in a standard WAV or BWF.

Damian: Let’s talk more about metadata.

For those of you struggling with the term, ‘Metadata’ is data that describes other data. Without metadata, your audio would be like players in an orchestra without sheet music or a conductor to guide them. 

The .wav files I create for broadcast are passed through an AC-3 data compression encoding process that necessarily involves some authoring of metadata. Can you give me a sense of what metadata is necessary for a successful television broadcast?

Michael: First off, my answer, in this forum, can’t really be comprehensive. I absolutely encourage everyone who creates content that will be consumed via a Dolby ecosystem to become very familiar with Dolby’s website and the extremely helpful documentation that can be found there.

The short, glib answer to your question is, all of it. Every metadata component in the Dolby Digital/AC-3 environment is important - but a few are crucial. I’ll list them below, but it’s absolutely vital that everyone is very clear on this point: 

Metadata does not affect your mix until someone listens to it. Including you! 

Here’s the brief: 

You cannot make AC-3 without having values for all metadata parameters included in the compressed/encoded version.

You cannot listen to AC-3 without decoding it and a consumer cannot decode AC-3 without forcing the metadata to exert itself.

So… the only way to truly know what your show will sound like is to mix it, encode it as AC-3, then decode it and listen to it, switching between multiple listening formats/conditions to evaluate the consequence of said metadata.

Here are the super important bits:

  • ACmod

    • Announces the Audio Channel configuration of the Programme. In Dolby parlance, a 5.1 mix is identified as “3/2L” (3 front channels, 2 rear/surround channels, and (L)fe (low frequency effects).

  • BSmod

    • Announces the “bit stream” mode - in other words, what kind of payload is this. For the most part, finished programmes carry a “CM” designation which means “Complete Main”.

  • Dialnorm

    • In Dolby’s original view, this should be a description of the loudness (in LKFS) of the average Dialogue Loudness in the Programme using Dolby’s Dialog Intelligence software. Latterly (especially since the arrival of A/85) this parameter generally communicates the total overall loudness using BS1770 of the full mix of the Programme.

  • DRC profile

    • Dynamic Range Control. This parameter details the dynamics control that will be exerted on the mix by end-point equipment. That is, DRC is only applied when the material is decoded and listened to, and the particular ‘flavour’ of compression and limiting that will be applied by the decoder is determined by the mode/configuration that the Viewer is listening to (ie. listening in stereo will potentially invoke DRC behaviour differently than if they’re listening in 5.1)

  • Preferred Downmix (& Universal Downmix parameters)

    • Obviously massively important - these 7 parameters determine how various equipment (based largely on the decoder's age and its use-case) will interpret a demand for a 2-channel variant of a 5.1 channel payload - and how the 6 channels will be summed to create a 2mix (stereo downmix).

Damian: For live television, metadata is authored in more or less real-time. I’m of the opinion that for pre-packaged TV content, the metadata would be authored in non-real-time and this data would then be contained in the AC-3 files on a media server?

Michael: Sure. That’s absolutely the way it’s supposed to work. Except for the fact that AC-3 is massively lossy! So you’re meant to encode it once, right at the very end of the content chain, in the last possible second before the signal leaves the building and goes out to the consumer.

For practical purposes, this means that everything upstream of that Dolby encoder needs to remain “baseband”. That is, the audio is mated to the video, generally as uncompressed, discrete PCM channels embedded in the HD-SDI picture.

What this means is that there’s no effective way to carry Dolby metadata from the creation point (the mix room where you created the content) and the distribution point (that last step in the chain where the PCM audio is converted to AC-3 before leaving the Broadcaster)... and so most broadcasters are applying a static (or fixed) metadata profile at the encoder. This means that every piece of content must naturally match, or at least match enough that when common metadata is applied to everything, the various effects of that metadata are equal.

Remember Dolby E?

Full disclosure: it doesn’t have to be this way. Once upon a time, we commonly used a system called Dolby E as a mezzanine compression format, which allowed us to package 8 channels of audio (5.1+2) and all of the associated metadata and carry that payload all the way through the broadcast system, where at the final step, we would convert from E to AC-3 (inclusive of all metadata).

In this way, we effectively had an agile or dynamic metadata environment where there was a straight wire from the mixer’s hands to the viewer’s ears… it was magical when it worked. But it also required a broadcast infrastructure that could not contain many of the “safety nets” we talked about in the last article, things like Broadcast Loudness Processors. And while this was fantastic for folks who were doing it right, and delivering perfectly formatted and prepared content for us, it also exposed us to a lot of risk and complaints because too few producers dedicated themselves to doing it right.

Look, it was a thankless proposition… Dolby E was expensive and complicated, and it also added a huge amount of complexity to even simple media contribution/distribution/transmission challenges. On the other hand, discrete audio was/is easy, but the available technical solutions for carrying Dolby metadata as a sidecar (generally involving embedding the metadata in the vertical data space within the HD-SDI video signal) and maintaining the frame-accuracy of that metadata were not trivial.

Ultimately we had to revert to the safer, but decidedly more confined, working environment in order to achieve the kind of consistent and predictable results we require as a broadcaster. Here’s the kicker - if we’d stuck with it (as an industry), we’d be in much better shape now as we consider the possibility of carrying Dolby Atmos (and other Next Generation Audio formats like MPEG-H) throughout the chain.

Damian: I remember Dolby E very well. Some of the stuff you said earlier reminded me of using the rackmount encoder and decoder units to author the metadata to tape back in the day. I’m just wondering, is the AC-3 encoding/decoding process like the Dolby E encoding/decoding process in the sense that there’s a delay that must be accounted for by either moving the audio ahead or the picture back a couple of frames?

Michael: Indeed, although where Dolby E was/is always a frame delay (for both encode and decode), the Dolby Digital/AC-3 encode latency is much closer to 7 frames at 29.97 Drop Frame Timecode. These latencies are the reason why confidence monitoring via an Encode/Decode cycle is impractical; and the reason why Dolby created devices like the DP570 which was a metadata authoring and emulation tool - specifically designed to overcome these problems.

How Good Is Dolby AC3 Data Compression?

Damian: Let’s look again at AC-3 for a moment. As mentioned prior, it’s a lossy data compression scheme, based around psychoacoustics. “Perceptual Coding” is the term used to describe the principles behind their algorithm. This really means AC-3 reduces data based on the notion that our brains can either fill in what’s missing, or AC-3 removes data our brains don’t need, contained in an uncompressed audio file. AC-3 has been around since 1991. 

A critical listening comparison between a mix done at 24 bit, 48 kHz and an AC-3 file derived from this uncompressed parent file does tend to reveal the differences but in the home viewing context, it’s not the least bit bothersome. 

The loss mitigation, I’m guessing, happens at higher bitrates. When we talk about compressed audio, we do generally reference the bitrate. The higher the bitrate, the more data is contained in the bitstream per second.

My question for you is what’s the typical bitrate for an AC-3 encoded, 5.1  television broadcast audio signal? 

Michael: 256kbps is a decent way to think of it. Many broadcasters are likely down in the 128kbps range, and a few might run at a higher rate, but 256kbps is a solid average I think.

Everyone, please remember that this isn’t new. In fact, if you live in North America, and you’re under the age of 35, there’s a good chance you’ve never heard a network television signal that wasn’t compressed this way. Network ‘forward feeds’ (the signal emitted from the Network Center which is then handed to a local station before being re-broadcast to the audience) have been compressed using AC-3 since the middle ‘90s - long before HD arrived.

What I mean is, AC-3 is a desperately old technology given the pace of development in our industry. It’s a 30-year-old codec. This is important because while crunching a 5.1 mix down to 256kbps (approx 8:1 data compression) renders perfectly reasonable results under AC-3, the newest variant of the Dolby ecosystem (called AC-4, the defined audio standard for ATSC “3.0”) is a few decades newer - with exactly the increase in efficiency that you might expect. In other words, better days are coming!

Damian: Just for the reader’s edification, I tried using a freeware program called “Omni Converter” to test the AC-3 encode/decode at 256 kbps on a recent mix. This program doesn’t do 5.1 so I used a stereo fold-down 48 kHz, 24 bit, .bwav. I encoded and decoded the file and then lined it up behind the original on my Pro Tools timeline. Listening to them both, I could hear a bit of loss of detail but nothing too drastic. I popped the files into RX and checked their spectral content, cycling through the frequency scale settings to see if there was any real, noticeable roll-off or ‘hole’ in the spectrum until 20 kHz, where the AC-3 signal hit a brick wall that the 48 Khz .wav file exceeded. Not bad for 30-year-old tech! 

For a detailed description of AC-3, I found this article.

Micheal, after all this talk about data compression, what some people might be wondering is ‘Why do you have to compress the audio at all?’ 

Michael: Easy answer, money. Make stuff consume less space, and suddenly you can fit more stuff! While the original reasons for exploiting data compression were definitely about saving space (and the corresponding dollars!) it’s much more important today due to access. If it did nothing else, the appearance of the Television Everywhere (TVE) era created the conditions where access to content is almost the most important aspect of any content discussion. And since bandwidth isn’t yet free (nor freely available), then compression isn’t going anywhere soon. 

Damian: Is Dolby ‘King of Audio Metadata?’

Michael: Long Live the King.

Damian: I’ve been in TV master control rooms (MCRs) dozens of times in my career for various reasons. They’re a collection of TV’s, speakers, computers, other gear and usually, have near blanket access to media servers, satellite and fibre optic feeds and other connectivity both internal to the broadcaster and external. Once my mix is delivered to and accepted by a broadcaster, is it then ingested to a media server for master control? If so, can you detail the regular processes?

Michael: Those media-handling procedures are pretty easy to understand actually. Does the audio match the format and channelization we require? Is there a stereo (2ch) mix provided? Is there DV? Is the loudness of those 3 programme variants compatible with our ATSC A/85 expectations? The MCR environment needs to have the tools necessary to arrive at answers for those questions, as well as having the technical ability to make corrections or adaptations in the case that the answer to any of those questions is “No”. So we analyze the content, correct it as required, then digitize the result into our Asset Management systems for playout.

What Happens To A Stereo Only Mix?

Damian: What happens if a distributor only gives you a stereo mix to broadcast?

Michael: Kinda depends on the broadcaster. What should happen is that your 2mix gets broadcast “as is” - since that’s what you supplied. I’m sorry to say though that in many (or may most) cases, the broadcaster is going to upmix your 2mix into a faux-5.1 signal… and this is deeply related to the idea of static metadata being applied at the emission point, thus requiring all content to match in form-factor.

Damian: Ouch! I didn’t realize there was any possibility that stereo mixes were being upmixed. Note to distributors: Update your spec’s to the 21st Century! 

Now, a major broadcaster like Bell Media is sending out multiple channels simultaneously and to reduce the infrastructure cost of this, as you’ve mentioned, data compression is used both on the audio and the video. Just generally, are we talking satellite feeds between broadcasters and fibre optic networks to consumers? What are the general paths?

Michael: All of it. The feed from the Broadcast Centre (or Station) to the BDUs (cable companies, etc) is compressed, and it’s generally a “pass-through” by the BDU… handing our signal directly to the consumer. In some cases though, the carrier subjects the signal to further compression before passing it on. Of course, our OTT signals (web streaming, mobile streaming, VOD, etc) are also compressed, and even the signal emitted by the transmitters (in the case of our terrestrial channels (aka the stuff you receive via an antenna) are compressed. The only thing really that varies is the type or severity of the compression, but the presence of compression itself is immutable.

Damian: And to be clear, all of this data compression is to streamline the information to allow the hardware to pass multiple channels worth of data simultaneously. It’s all about getting as much information out there as possible, knowing that some data loss is acceptable. 

What About Dolby Atmos?

Turning the page, I know from personal experience that you and your team have been mixing in Dolby Atmos for years now. Can you broadcast in this format?

Michael: Theoretically, yes of course. We’d broadcast it as something called EC-3+JOC (or DD+JOC) which is basically Dolby Digital Plus (EC-3/DD+) combined with something called Joint Object Coding… all of which is effectively a placeholder while we wait for a full AC-4 eco-system to arrive. Regardless, these Atmos payloads would accompany broadcasts, compliant with the current state-of-the-art standard known as ATSC 3.0. In Canada, no one that I’m aware of is yet broadcasting (terrestrial over-the-air) in 3.0 - and certainly, no one is emitting Atmos. For us, our ability to transmit UHD and Atmos will be restricted to our so-called “digital” distributions - like our web, mobile and smart TV offerings, which collectively are known as “Over The Top” (OTT). 

Damian: Is Bell Media accepting Dolby Atmos files even if they don’t currently broadcast them, and if so, what are the requirements?

Michael: This is all still very new. We’ve dedicated ourselves to mastering Atmos content-creation less from a specific requirement from our partners and clients, and more so that we can be ready when we’re asked. By extension, this has allowed us to produce several seasons worth of programmes that will hopefully spend their lives being sold into various foreign markets over the coming years - a lifespan which we hope to have lengthened by having a “future proof” format like Atmos sitting on the shelf, despite that fact that these programmes will have originally aired in their HD & 5.1 variants. Regarding Atmos itself though, and the prospect of carrying it through the entire chain, it’s important to note here that when it comes to Atmos and AC-4, the requirement for metadata synchronization increases enormously: AC-3 utilizes frame-accurate metadata, while Atmos requires sample-accurate metadata!

Damian: I want to return to the topic of audio levels, which we discussed last time. Let’s say I have a commercial that is ‘in theory’ legal at -22 LKFS since it’s inside the plus or minus 2dB window set out in your broadcast spec. We both know it’s louder than the program. Does something in the broadcast chain alter the levels in any way or does it play through the system at -22 LKFS and then hit a dynamic range compressor in a set-top box or TV?

Michael: Both. The Loudness processor in the Broadcast Transmission chain will attempt to hold the content within the A/85 window… A spot that arrives with a -22 average must by definition almost, be louder than -22 for some time… and that means there’s a good chance the Broadcast processor will lower the level slightly… Additionally, depending on how someone is listening, and how their equipment is configured, you may also be subject to the DRC processing in the set-top Dolby decoder.

Damian: Really then, it’s a losing proposition to aim for the top, as far as levels go. Some regulators, broadcasters and distributors narrow that dialogue normalization window to just plus or minus 1dB to put the squeeze on commercials specifically. Is this something you feel ought to be done?

Michael: The Loudness War is just like any other Arms Race… it’s an ebb and flow of compatibility and competition. There’s always going to be folks wanting to push the envelope to try and create an advantage for themselves. In my experience, the longest distance between two points is always a shortcut. There’s no substitute to doing it the right way - in the long term, you’ll be more consistent, more predictable and ultimately happier with how your stuff translates.

Damian: Is there anything else you’d like to add?

Michael: If you’re not a card-carrying member of the propellor-head-society and are just a beleaguered mixer trying to make their way in the world, I get that this is seriously un-sexy stuff to think about. But it’s massively important. Since our hallmark as professionals should always be our ability to warranty our work, we ignore this stuff at our peril. It would be like a race driver refusing to learn about the inner workings of his car’s transmission! Get dirty! Dig in and figure it out. You’ll be better for it.

Damian: Thanks again for all of your brilliant insight. 

Michael: Always a pleasure Damian. Can’t wait for the next one!

It’s a real pleasure to have someone around to speak with in detail about very misunderstood parts of the process, downstream from my work. 

In Conclusion

I’ve been fortunate to spend a lot of time upstream, not only sitting in edit suites with a good many top quality picture editors, I’ve also worked ‘hands-on’ with AVID’s Adrenaline and Media Composer video editing software during Olympic Games, as audio recording, editing and mixing tools. I regularly import and export my own tracks and create my own EDL’s when I’m working in a Media Composer-based broadcast environment. I know that part of the ecosystem well enough to have a bit of insight into the workflow before audio post but now, thanks to Michael, I’ve a lot more information about what happens after my contribution is completed. 

I’m not sure I’m an “expert” in this field but I do know that,  in order to become an expert at anything, I need to know not just where my materials come from or how to do a job, but how to deliver my creations to the ears that will hear them. For me, to become a Pro Tools Expert in TV Audio Post Production, I need to know more about the things that aren’t Pro Tools, so that what I deliver holds up downstream.  

See this gallery in the original post