Professional Techniques For Mixing Dialogue For Picture

April 22, 2022 Damian Kearns

Do you struggle with your dialogue mix or are you a total pro? Is your end product underwhelming and hard to understand, or pristine and punchy? Not sure if you’re harming or helping your audio? Should you pan dialogue or leave it in the centre? In this article, Damian Kearns discusses his techniques for dealing with the most important aspect of any post audio mix: Words.

Intro

Dialogue is the fundamental audio element in storytelling, as the script, performances and technical success of the mix all hinge on the clarity and performance of the words. After spending years organizing, editing, repairing, premixing, and mixing for TV, film, internet, podcasts, corporate content, sporting events, news and current affairs items and more, I believe I have a successful formula for delivering the best possible dialogue for my clients. In this article I’ll outline the various steps I take to help my clients get the most from their words and actors, subjects or interviewees. I’ll also discuss my dialogue mixing philosophy; my reasons behind my decisions. The idea is to share my workflow, so that readers can parse the info for themselves.

Before anything else is written, I want to underscore the reverence I have for the post production process. I know different readers are involved in differing workflows but I believe this article can be helpful for a wide range of post audio professionals. And, as always, this is my main intent: People learn from each other’s examples and adapt the best ideas to their own style. Let’s walk through the steps.

Step 1: Get It In, Get It Organized

I start off most of my projects by providing the picture editor with a list of things I’m hoping to see when I receive their tracks. As you might expect, I ask for 24 bit, 48kHz, .wav files, typically embedded inside an AAF or OMF with 150 frame handles. I arrived at the 150 frame handle target when I used to do a lot of variety shows, because five seconds is a reasonable amount of crossfade time between big laughs and incoming and outgoing applause. Three seconds was never quite enough.

Something else I advise is that if the picture editor names their tracks as they edit, their delivery to me tends to be well organized and easy for me to interpret. Keeping mono elements on mono tracks and stereo elements on stereo tracks is another big ‘ask’ from me but if they are naming their tracks, this can be part of the whole editorial process from the very start. I know the cynics will say this is all great info but likely to be ignored by the editor but to those people I say: It’s better to ask up front, rather than sound like you’re making excuses for taking a longer time to do your job when things aren’t delivered in a reasonable fashion. It’s better to have asked for it all in print as early as possible, so if someone doesn’t heed the suggestions, it’s a simple matter to justify any overtime spent putting things right. If you cc your Producer or Post Supervisor when you’re sending your delivery requests, you can expect an ally amongst your project’s management.

Once the tracks are imported to my DAW post audio session, I make a copy of all the imported tracks and then deactivate and hide the original tracks so I can always come back to them if needed. I then like to spend time moving actors/interviewees/characters to their own dedicated tracks. I sift through multiple mics to find the best sounding ones for each person and once I’m satisfied I have everyone where they need to be, I’m set for the next step.

Pro Tip: I colour code my dialogue tracks so I know which mic channel I used. In this picture, gold means mic channel 1 and the reddish colour I used for Marvin means I used mic channel 2.

On CNN’s “History of the Sitcom” I had 60 or more interviewees to deal with in each episode of the show so I couldn’t dedicate a track to each person. In that case, I created four to six tracks and checkerboarded interviewees throughout each segment of the show. I’m not a huge fan of checkerboarding my mics but there comes a point where there are simply too many to contend with on individual tracks. For most of my shows, I will go up to 25-30 tracks of dialogue before I start considering checkerboarding mics in order to manage track count. On these larger shows, I tend to premix across all tracks and then when my premix is complete, I’ll move everyone around until I’m left with eight to ten tracks.

Pro Tip: In Pro Tools, I create selection markers for each character when I have to checkerboard my mics I use selection markers to recall a timeline selection for the character so I can copy plugin and gain settings to another part of the timeline where the same character pops in.

I name my dialogue tracks for the person who occupies a given track but if you look at the screenshot above you’ll also see W11 written beside the name on the first track (Goulson). This is because in my Pro Tools template, each dialogue track has its own window configuration. The window configurations allow me to use a keyboard shortcut (number, then asterisk ex. 11*) to open up window configurations that recall the windows for the plugins on each individual window dialogue track, so W11 opens up my first dialogue track’s plugins, as well as my master meter so I can see where my dialnorm is at. I spent quite a lot of time setting these configurations up but wow, does it ever save me a lot of grief while I’m working.

Why not start my numbering at W1? Well, the first nine window configurations in my template are dedicated to various tools or toolsets I use for signal processing. W1 is my dialogue toolbox. Here it is:

W10 is my narration track which isn’t location dialogue and for most deliverables, is printed through a separate path to other dialogue. This makes W11 my first dialogue track. I find this easy for me to remember and I’ve worked this way for years. If you work in a DAW that supports storing and recalling window configurations, you’ll likely find this a powerful, speedy workflow.

Step 2: Level Up

Once organized, I like to look at any clip gain or volume automation that might have been included in the audio I imported to my session. In rare cases, there are volume dips that should have been rendered fades so I put fades in where they’re supposed to be, guided by the editor’s volume automation. Then, after I’m confident I have the fades where they need to be, I remove all the gain— both clip and volume gain- so everything is nulled on the dialogue tracks.

At this point, I use an offline dialnorm analyser like iZotope’s RX Loudness Control (RXLC) or if I’m working in Pro Tools, I can also use Avid’s AudioSuite Pro Limiter Loudness Analyser (measured in LUFS). What I really want from these readings is a sense of how much clip gain boost or cut I need to get things to my target dialnorm which for tv is typically -24 LKFS, without my audio peaking above 0db Full Scale (0dBFS). For really dynamic audio, I find I can’t quite reach -24 LKFS without going above 0dBFS so I boost the clip gain until I’m near the peak limit and use my track’s real time plugin compressor makeup gain to add the remaining gain I need to hit my dialnorm target.

Why do I take such great care to keep my dialogue from exceeding 0dBFS? For real time processing, it doesn’t really matter but if I end up doing any offline processing in iZotope’s RX or through offline AudioSuite plugins like you saw in my W1 window, any clips will be written to my session with 24 bit fixed point processing which means peaks above 0dbFS will distort. I’ve encountered this many times with other audio editors who haven’t had their gain structure under control; they’ve ‘cleaned up’ audio with offline noise reduction and in the process, distorted the audio they were meant to restore. Gain structure is always critical to keeping things sounding their best. If you retain nothing else from this article, keep this one thought in mind: The goal is to make dialogue sound better.

Why do I use clip gain instead of volume automation? Clip gain happens pre-fader and pre-processing. If I have everything properly set at the clip gain level, I’ll probably only ever need to maybe turn something up a few dB here and there during the final mix. Mixer faders only have 12 dB of gain so gainstaging to the ‘0’ point on my fader should give me all the space I need.

Years ago, when I was a night time post production assistant, I used to apply the same principle to my transfers from the dialogue editors’ workstations to tape. I was a junior operator back then and I remember one of the mixers saying to me, “I don’t know what you’re doing but thanks for keeping my faders at 0 dB”. It’s the perfect place to start the mix balancing act.

Step 3: Go With The Signal Flow

Real time plugins can be both a blessing and a curse. They’re a blessing when they’re making the most out of your sound and a curse when they are applied as a sort of stack of “Hail Mary, hope this works!” mess of processing modules. By this I mean, signal chains that are built through a process of stacking plugins to make up for a lacklustre sound tend to contain redundant processing and often poor or little gain staging. If we apply a degree of logic to our choices, we can ‘bless’ our dialogue with a deft touch, rather than curse it by trying to wring out some sonic element that isn’t there. For any real time plugin, I suggest enabling automation. More on this later.

For real time track processing I favour:

Slot A Any real time noise reduction. If more than one plugin is needed, my hierarchy is, NR, De-Reverb, then possibly Mouth De-Click.
Slot B High and Low Pass Filters
Slot C Dynamic Range Compression
Slot D Equalization
Slot E De-Essing (this step is optional in many cases)

Why place noise reduction first in the chain? The answer lies in the philosophy that removing unwanted elements from a sound is inherently less destructive than adding more of something that mightn’t be inherent in the audio. This jibes with the mindset that using equalizers to cut unwanted frequencies adds less colour to a sound than boosting frequencies that aren’t actually there. What’s more, when we remove low end content specifically, our compressors don’t need to work as hard because bass slams into dynamic range compression and drives the whole signal lower. Noise reduction can do this and do it very well, if carefully applied.

Real time noise reduction from companies like CEDAR, iZotope, Waves, Acon, Accentize and others can end up doing a lot of the work of an EQ, in the sense that NR removes unwanted frequency content. Noise reduction doesn’t have to happen in real time, as we all know. This ‘Slot A’ could well be swapped out in favour of offline noise reduction software options like iZotope’s RX or an AudioSuite plugin but if you think about it, this offline workflow still puts NR as the first processing element in the chain since it’s being applied at the clip level.

As I always say, the less noise reduction applied, the better. The last thing we want to do is over-process performances to the point the audience is (painfully) aware of our work. It happens that sometimes it’s unavoidable but mostly, I think, we can get away with removing less noise than our brains tell us to take away. Real time noise reduction gives us the advantage of nondestructive, ‘in context’ auditioning of our reduction settings which is why I favour this method in a mix. There’s nothing more valuable to me than hearing my dialogue with all other elements in place. Sometimes noise isn’t noise, it’s ‘air’ or ‘ambience’. Context is key.

Following noise reduction, are filters. Think of them as another chance to tame the top and bottom of your signal’s frequency spectrum before your compressor has to do it for you. Something like Zynaptiq’s UnFilter, which I have owned and used for years, could also be used here, as it is what I like to refer to as a ‘timbral focuser’. To me, UnFilter is very much in the family of filters and EQ’s. The idea here is to sculpt the energy at the bottom and top of the spectrum, carving out unwanted mud or hiss or high frequency tonal content or noise from our dialogue. My philosophy is that the more unwanted audio I can remove from a sound before compression and EQ, the less these processors have to work.

Now that filters have been set, it’s time to look at compression.

I’m a big fan of starting with a low compression ratio of 2:1 and a threshold at about -20 or slightly above. Compressors with ‘lookahead’ (the ability to pre-read incoming audio and automatically adjust attack and release characteristics to optimise processing) work really well for dialogue but more passive compressors are perfectly fine too. Sometimes, I’ll engage compressors with auto-gain control, if I’m looking to speed through a mix or for a certain ‘quality’ that a specified compressor might have. Compressors can add lovely colour and even help fill out aspects of an audio track, by fattening things up a little.

Often, I’m using compressors inside a channel strip like Avid’s Channel Strip, Universal Audio’s SSL 4000 E Channel or API Vision Channel Strip, iZotope’s Neutron 3 (Alloy 2 is still my favourite iZotope Channel Strip of all time), or McDSP’s Channel G Console. Channel Strip plugins are very convenient to use and what’s cool is there’s usually a quick button press to move an EQ pre or post compression to hear what sounds better. I really like that. I used this method on a recent mix with the UA API Vision Channel Strip engaged on some voiceover and settled, once again, on placing my EQ post dynamics. This doesn’t mean it’s always the right place to put an EQ but I do find if I’ve applied NR and filters, the compressor smooths things out and the EQ adds colour.

Compressor attack and release times are critical to get right. We can end up sawing off transients with quick attack times but something else to avoid is harmonic distortion, particularly on voices that have a lot of low frequency character. If you want to hear harmonic distortion in action, strap a signal generator to a channel strip, set it to 80 Hz, and then start playing with attack times and thresholds. You should start to hear a bit of an edge creep in as you play around. I was able to really hear a lot of distortion by playing with Avid’s Fairchild 660 plugin with its fastest time constant. Backing off the threshold reduces the distortion very quickly but so does moving to a slower time constant or release time. Suffice to say, this harmonic distortion can add grit to your dialogue. Some people like it, some don’t. My opinion is I’ll use distortion when I need something to ‘cut’ in a dense mix but I prefer multiband exciters for this since I can focus the distortion on selected parts of the frequency spectrum. The thing to remember is if we’ve already tamed the bass with NR and a filter before we hit the compressor, we probably won’t hear much or any distortion. This is yet another reason to deal with bass early in the processing chain or by processing the audio clips with offline software.

Mercifully, we do live in an age with ‘lookahead’ and smart compressors that can gauge how to deal with incoming frequency content before it hits the plugin’s input.

Equalizers have come a long way since I started mixing in the 1990’s. They can be quite intelligent like Sonible’s smart:EQ 3 or SoundRadix’s SurferEQ2, feature packed like FabFilter’s Pro-Q 3, ultraclean like Avid’s Channel Strip or Nugen Audio’s SEQ-S, or just be straight up great sounding old school EQ’s like Waves’ Kramer PIE or virtually anything from McDSP and Universal Audio. There are also dynamic EQ’s or resonance suppressors like Oeksound’s Sooth2 that can function— often on a per frequency band basis- as a multi-band compressor expander. The question only you can answer is: What do I want my equaliser to do?

I know a lot of people favour the modern smart EQ’s and dynamic EQ’s and then there are other operators like myself who are quite content with manually automating everything on a channel strip to sculpt the sound exactly as I want. There’s no right or wrong here but I do think if your EQ isn’t being automated, chances are you are underselling areas of your mix that would benefit from a little boost or cut. There are certain superhero TV series I’ve heard where the mixers obviously weren’t automating their equalisers because there are some brutal differences in spectral quality between perspective cuts and takes that could have been finely tuned to sit nicely together with a little bit of EQ automation.

Continuing that thought, to round out this section, noise reduction, filters, compressors, equalisers and de-essers/dynamic EQ’s should be automated all the time because no one performs or speaks like a robot. There are dynamic changes, mic position differences, perspective changes, alt lines, ADR lines and set noises that should be reacted to, which is what a mix is: A mix is a canvas of emotional and technical reactions played out in real time but sweated over for hours, days or months. Whenever I’ve inherited a mix with no automation it always seems to me that the only things reacting are the compressors and that is not mixing, that’s a ‘product’, bereft of the art of mixing.

One last thing about the nose reduction. If it’s in the first slot, as I propose, then rendering the noise reduction will free up the associated DSP without changing the way the audio clips interact with any filters, compressors, EQ’s and de-essers upstream. Regardless of whether or not you agree with everything else I’ve written, I think we can all agree that computer-based mixing definitely entails managing system resources effectively which means continually addressing RAM, disk space, CPU, DSP and GPU usage.

Step 4: Premix?

Assuming we’ve got everything edited, levelled, cleaned up, compressed and EQ’ed, that’s a premix. I want to submit several more ideas as potential aspects of a good premix.

Use a mono reverb to fix any franken edits between words or sentences where the natural decay of the outgoing clip is truncated. I’ve used a variety of reverbs to rebuild ‘Franken’ edits over the years. Nothing I’ve tried beats Liquidsonics' Cinematic Rooms. The parameter that makes it a real standout for fixing brutal dialogue cuts is the proximity knob. I’m able to reintroduce the character of the bass building up in a room with this setting. The repairs I make sound very clean using a mono instance of this plugin on mono microphones. Here’s a shot of my default dialogue setting. I simply dial in the reverb time and the proximity and yes, I automate these features because no two rooms sound exactly the same. The pre-delay is also something I mess around with, as well as the size and the bloom. I have used various algorithmic reverbs in the past so if you don’t have Cinematic rooms, experiment with your favourite mono reverb that emulates real world spaces.
There are also tools like Chameleon by Accentize or iZotope’s Dialogue Match, which aim to emulate natural resonance to repair Franken edits. I’ve been using Dialogue Match since the day it came out. The advantages of this sort of software include being able to quickly edit a reverb model and store it as a plugin setting for later recall. This is extremely helpful on very large scale dialogue premixes. The only downside to Dialogue Match is the EQ Match module doesn’t work very well. That being said, I don’t need to EQ Match when I’m working on the same character/interviewee with the same mic.
Move off-camera dialogue and narration around to sit better with music and create spaces for SFX and on-camera dialogue. I’ve been known to snip unnecessary words–even sentences- to create breaths between lines of script or to open up the mix so music or sound effects can have a proper moment. Some might say this is a dialogue editor’s job but my feeling is an editor will rarely have all the mix pieces in place during editorial. Creating these pauses and breaths is the mixer’s job because this is all about balancing elements and creating the rhythmic flow of elements. It’s all about the structural interaction of audio elements.
If a dialogue line doesn’t quite have the right inflection for the mix, consider using a pitch contour program like The Cargo Cult’s Envy 2. While this should be part of dialogue editorial, it’s often the case that our music or sound effects set a tone that an editor might not have heard, if those elements were absent while they were working. This is some ‘next level’ fine tuning stuff but it can take a mix and elevate it. Subtle changes to pitch can really push the balance beyond the ‘connect-the-dots’ stage to a ‘work of art’ mix.
Panning dialogue is typically something I do in the premix stage since I’m typically premixing while listening in surround sound. The trick with dialogue panning is basically to talk yourself out of doing it unless it’s critical to do so (just kidding).
The centre speaker in a surround mix is the dialogue channel. It was originally instituted to anchor dialogue so anyone sitting anywhere in a theatre could hear the words well. If I need to walk someone out of the centre of the mix I typically start turning them down so by the time they’re off to the left or right, the dialogue might be 6 dB lower, to account for perspective. I might also automate my low pass filter to intentionally remove a bit of top end for perspective.
Panning, when it is done, is best done by referencing what the effect will be on a stereo fold down. Some people ‘shoulder’ dialogue between speakers and then hear a level boost in the stereo fold down. This can be mitigated by reducing a pan’s centre percentage or using delay on a separate copy of the dialogue in question.

Step 5: Up Is Louder!

The final stage of dialogue mixing is the fader stage. The goal should be to hear every word clearly, in the right perspective. The best way to do this is with a fader or some other track volume controller. I tend to do some gross clip gain adjustments on dialogue when people start drifting down in level, like at the end of a sentence or when they get quiet for dramatic effect but the final levelling is left by me to do with a fader in real time. Because I’m premixed and gain staged to my fader’s 0 dB setting, I’m usually just wiggling the faders a bit to make each syllable consistent and clear. I’ll automate EQ to maintain the sparkle and keep bass sounding neat and tidy and also automate my compressor’s threshold, ratio and makeup gain along the way. This is the stage where any effects like reverb, delay, distortion, futzing and modulation are added. By the time I’m through this stage, I’m ready for playback.

If I’ve gainstaged my dialogue properly at the onset, I really don’t have to worry too much about hitting my dialnorm target. I ought to be close even without looking at my meter, as long as my monitors are set at the right sound pressure level. Here are some things to understand about monitoring:

If you’re a dB louder than your target, turn your monitors up by a dB so next time you mix, you’ll hit the mark.
If your loudness range (LRA) is too wide and exceeds your delivery spec, turn your monitors down and work through the mix again, pushing the low level elements louder to close that dynamic gap.
Multiple monitors are the way to go. I use Source Elements’ Source-Nexus to route audio from my DAW in real time to my control room’s TV, or to the internal speakers of my Mac’s display. The TV and the computer monitors are unforgiving so having these lower fidelity alternate paths is great for ensuring that every dB of a mix is maximized. Again, if you’ve tried to make every syllable audible and uniformly levelled, a TV or a computer monitor or a phone will let you know how well you’ve done.

A Note About Downstream Processing

Dialogue feeds into a dialogue submaster and onto a mix submaster in all my sessions. The dialogue sub is a great place to add a little weight or sparkle to the overall dialogue track. I tend to use a nice sounding multi-mono EQ on my 5.1 surround dialogue sub so if I only want to shine up the centre channel, I can unlink the plugin and work on each channel of my 5.1 independently. I usually have a peak limiter on this sub so that I can send my tightly controlled dialogue to print to any stem or deliverable and know it’s already legal, as per my delivery specs. My 5.1 Mix submaster has a very light compressor on it and an overall mix limiter as well. This final compressor allows me to tame my loudness range and also to ensure my stereo fold down is tight and punchy.

If you operate Pro Tools and would like to look at my template, it’s here. There are helpful notes along with it. You’ll note I group my dialogue into a mix group which I use during the fader stage of my final mix.

Final Things To Consider

Although I personally favour manual level adjustments over software-based auto gain control, I sometimes employ levelling tech in my mixes. I’ve friends who use Waves Vocal Rider for their short format commercial mixes but I tend to use iZotope’s RX 9 Advanced Standalone software to even out dialogue sometimes. There’s a Leveller module in there that does the job quite nicely, once the right settings are in place. Unfortunately, iZotope only offers this module in the advanced software version. A product I’ve been meaning to try— and will- is WTAutomixer V2. If you mix a lot of interview style shows or work with multiple character mics, this is a product to demo.

It’s just as likely, for densely packed mixes, that I will key a multiband compressor on my music, using a send from my dialogue and narration tracks. I favour a multiband compressor instead of a single band compressor, as I like to narrow the scope of my level dips to specific areas in the frequency spectrum. I find this very useful for creating room for dialogue, particularly in the midrange.

I also tend to approach breath reduction manually. Unless a breath is really bothering me or I need to chop up voiceover so I can shift the pieces around in time, I tend to leave in the natural sounding ones. Again, there are various tools out there to get the job done via software but this element—breathing- is something I only trust myself to tackle.

One element I didn’t delve into is room reverb reduction. I have long used Zynaptiq’s UnVeil for real time reduction but about 50% of the time I use RX’s Dialogue De-Reverb to get the job done. Accentize has a tool called DeRoom Pro which is on my list of software to check out.

Dialogue Epilogue

Dialogue was always the purview of the lead mixer in any mix theatre I attended in my early days. I was fortunate enough to have learned from more than 20 different film and tv re-recording mixers before I had to start dealing with the spoken word so I wasn’t at a complete loss once I got to the ‘big chair’ but developing my current approach still took years and a lot of trial and error, as well as many internal debates.

Like any art, there’s some science that can be applied but mostly getting better at dialogue mixing happens through experimentation, learning from people who’ve been doing it longer, and lots of practice. If I could offer one final tip, it would be to avoid relying on software to deliver your product for you. A good mix is one that is well balanced, supports the plot, brings those goosebumps up on the arms at all the right places, and transports the listener. There are a million ways to arrive at the mix’s destination but without your emotions guiding your decisions, your destination will always be a heartbeat out of reach. Emotions, I believe, are the most important tools we’ve got.

Photo by Emmanuel Ikwuegbu

See this gallery in the original post