How AI Is Changing Audio Post-Production

February 14, 2024 Mike Thornton

In the first article, Mike examined how artificial intelligence is impacting music production. In this second of two articles, Mike Thornton takes a deep dive into how artificial intelligence is changing the audio post-production sector, including how text-to-speech and speech-to-text are already impacting post-production, how generative AI is starting to impact sound effects creation, how stem creation from mixed tracks is coming of age and a look at what’s next for AI in audio post-production.

We have already seen how generative AI is having a growing impact on text generation, code building and image creation. It's not perfect. In fact, it is far from perfect, but it is getting better. It is clear that large language models like ChatGPT are not always accurate. We have seen images of people having the wrong number of fingers. Although AI is still not that intelligent, if you use it as an assistant whose work you have to check and ask for some things to be reworked as they are not good enough, then artificial intelligence can be seen as a useful tool. With this in mind, how is artificial intelligence impacting audio post-production?

Avid has been undertaking research into a number of areas of audio post-production and in an article, Avid and the Future of AI: Faster Media Creation, Media Composer developer Rob Gonsalves speculates about where he sees AI going in the future…

“AI is expected to grow exponentially and have a transformative impact on media production. AI could someday suggest relevant clips or segments based on its analysis of a project's context, acting like an assistant to the human creator working with large volumes of footage or audio files. AI also could become increasingly involved in video and audio processing, like color correction, reformatting, and similar audio tasks. It may support pre-production processes by providing an idea of what a scene or project might look like, helping to shape creative direction.

As demonstrated in our HPA 2023 talk, generative AI technology will allow music, movie, television, script, and story creations to be done more creatively and efficiently with recommendations to assist the creative process. The world of asset and metadata management will change to be based on content that is linked together using knowledge management technologies.

As discussed in the April 2023 SMPTE Motion Image Journal paper, the areas of search will change to queries that provide richer contextual results than a simple match to a keyword. Media operations will change to be based on data and analytics driven automation. Cost savings will be obtained due to insights controlling various parts of the media creation process to be more efficient. Interactions with the application will change to accessible models that do not depend on a UI button click, but natural language-based descriptions of what you want to achieve.”

With those thoughts in mind, let's take a look in more detail at some of the areas AI is already having an impact in audio post-production before coming back to look at what the future could bring.

Speech To Text

In the article Avid and the Future of AI: Faster Media Creation, Avid states that they have been involved in AI for ten years already.

“Avid was one of the first companies to incorporate forms of AI into our solutions including PhraseFind and ScriptSync.

PhraseFind AI is now equipped with modern transcription results, rather than using phonetic searching, editors can use PhraseFind AI to catalog volumes of dialog-driven media, substantially improving search efficiency. Editors can effortlessly locate clips and begin editing directly within the search results of the displayed sentence and search word. PhraseFind AI also offers full text results in search and will initially support 21 languages for automatic language detection, eliminating limitations when working with foreign languages.

With ScriptSync AI, editors now can save time when creating scripts from clips and automatically align media with text in the script window. For example, by eliminating time-consuming manual sorting of dailies and aligning media with the script text, this streamlined process expedites project timelines and allows for smoother production workflows.

Avid has an AI-related partnership with Microsoft, having integrated Microsoft's Azure media indexer into our asset management solutions. This AI tool provides transcriptions and allows customers with extensive footage to automate the identification of elements like logos and people's faces. It can even determine the number of people in a shot or whether it's a day or night scene - tasks that would normally require tedious labor logging this information manually. Avid has exposed AI tools working with alliance partners such as x.news for AI-assisted social media research. This enables journalists and producers to perform more efficient topic-based research.

A new AI development that Avid is utilizing is semantic speech search, a data searching technique. It understands the nuance of language, interpreting keywords and the intent and meaning behind a search query. Another example is OpenAI’s Whisper, which does a good job on speech-to-text transcription, including multilingual versions. It knows about 100 languages and the quality is great. Once you have good transcription, it opens up closed captioning, subtitling, and search. If you have good transcriptions, you can read, skim, find, and select what you need. Then you can go back and choose sound bites or conversations for your project. Transcriptions help with unscripted material with a lot of dialogue that needs condensing.”

Text To Speech

As well as speech-to-text, which will help enormously with unscripted content production, AI technology is being developed so that users can produce voiceovers from text with no need to book a VO artist.

Tools like Siri have been around for a while now, but they're not good enough to use in professional production, but the technology is developing at speed. Take LOVO as an example. Their text-to-speech tool called Genny can express up to 25+ emotions. CineD’s Mascha Deikova asked it to read a poem using a young female voice and then repeated the request but applied the emotion ‘tired’. She wrote…

“The results were impressive and extremely realistic. What I noticed during this test, though, is that only some of the speakers in Genny’s library deliver “emotional” voice-overs. So, either you have to stick to the standard narrative speech or restrict your choice to the more emotional voice presenters.”

LOVO isn’t free, although they do have a limited 2-week free trial and pricing plans that start from $24 per month, and there are other AI voice generators, including Speechify and Murf.ai.

Finding The Right Music For A Scene

There are AI tools out there that can help you find the right music track for a particular scene. If you have spent hours hunting for the right track, then you will appreciate how useful this could be. For example, the British free music platform Uppbeat launched a new feature, which is in beta at the time of writing, offering AI-generated playlists based on text input from the user. You need to describe a scene from your video or explain what type of music you would like, and then Uppbeat will come back with a list of suitable tracks.

See this content in the original post

In November 2023, Audiio.com, the music licensing platform, released LinkMatch AI, a new search tool that allows users to find music faster than with traditional search engines.

LinkMatch AI searches for music by introducing a URL from YouTube, Spotify, TikTok, Apple Music, or SoundCloud. After analyzing the link, the engine recommends similar matches from the audiio.com catalogue.

Artificial intelligence analyzes the source in real-time in a matter of seconds. LinkMatch AI uses advanced algorithms designed to ensure relevant results and tracks. Another advantage they claim of AI technology is that the search engine improves the more a customer uses it. “We can also ignore the vocals and fine-tune our search to the background music”.

You can test LinkMatch AI with three free searches, and then if you like it, the Audiio.com Pro package includes the LinkMatach AI tool, which usually costs $199 per year, but currently, at the time of writing, they have a half-price offer of $99 for the first year.

AI-Generated Sound Effects

Following on from MusicGen, which we featured in the How AI Is Changing Music Production, Meta is also developing a sound effect generating tool called AudioGen and works on the same principle with the user describing what sounds they are looking for. Meta says…

“With AudioGen, we demonstrated that we can train AI models to perform the task of text-to-audio generation. AudioGen is an auto-regressive transformer language model conditioned on text: Given a textual description of an acoustic scene, the model can generate the environmental sound corresponding to the description with realistic recording conditions and complex scene context.”

We understand that the developers used public-domain sound effects to train AudioGen. But that in itself may be a problem as, in general, the sound effects in the public domain tend to not be as good a quality as those that are copyrighted and you need to pay for. Again, it's early days, and CineD’s Mascha Deikova said…

“My personal first experiences with AudioGen have been troublesome so far. While the model perfectly understands the wording and tries its best to find matching sounds, the overall track composition doesn’t feel consistent and realistic. Yet, it’s an amazing development, and I guess it won’t take long until AI offers a decent alternative to sound libraries.”

So it would appear there is still some work to do, but as we have seen from other AI sectors like text and image creation, the technology is improving at pace.

Improving Dialog Quality

We have been following this development closely here on Production Expert. It started with Descript, which we took our first look at back in July 2021 in our article, Impressive Software Takes Aim At iZotope RX - Descript Studio Sound Tested.

At the beginning of 2023, we took a look at Adobe Podcast Enhance in our article Adobe Podcast Enhance - Is It Any Good? Then, in April 2023, we compared Descript, Adobe Podcast and Hush in our Battle Of The AI Noise Reduction Apps.

In September 2023, Accentize released dxRevive and dxRevive Pro, with their Studio algorithm, which we tested for our article Accentize dxRevive Tested - Perhaps The Only Dialogue Cleaner You'll Need. Paul Maunder wrote…

“The Studio algorithm is the default and works to make your audio sound as if it was recorded up close in an acoustically treated studio. It’s great at suppressing annoying background noise and echo. It also tweaks the sound with an EQ to boost or tone down certain parts. This is the same algorithm found in dxRevive Standard. If some frequencies are missing, like from a recording on a phone, it helps bring those back.

In use, I’ve found dxRevive to do an amazing job at restoring dialogue recordings. It’s particularly impressive on band-limited or thin-sounding recordings and does a very effective job at bringing them to life and making them sound as though they were actually recorded well to begin with! ”

ADR And Foreign Language Translation

Anyone who has worked on ADR, short for Automatic Dialog Replacement, will know that there is nothing automatic about it. Most of the time, ADR is about the skill of the actor, directors and audio team to get a good take that will match the picture. There are tools that help, like VocALign and Revoice Pro from Synchro Arts, but what if you could change the video to achieve lip sync rather than the audio? Check this out…

Now, this is a demo by the developers Synthesia, so the left-hand example would never get past any respectable ADR engineer, but the right-hand image looks remarkable. Normal ADR is not the only thing that this AI-powered tech can do.

The CEO of a UK startup pioneering this technology believed that we could have computer-generated versions of actors that are indistinguishable from real humans, with some branding this as ‘deepfake’ technology.

Victor Riparbelli, who cofounded Synthesia, said that the goal is to break into the world of TV and film special effects.

"As the company moves forward we are going to expand our platform and the plan is to start working with film and entertainment and make ideas come to life much [more easily] than they are today."

He went on to explain that the tech Synthesia is developing is the same basic process that is already used in Hollywood films…

"we're just doing it with neural networks which make the process completely automatic."

From this demo, it would seem there are already some great applications in the film and TV industries. Maybe it could signal the end of back-of-the-head shots when the script gets changed post shooting. With this tech, we could get the actor to voice up the new lines and then this AI tech could manipulate the video to make the actor’s lips match the new script.

Synthesia first generated interest in what they could do when they demonstrated their technology to enable a BBC news anchor to appear to be speaking Spanish, Mandarin and Hindi by using their artificial intelligence software. The software first mapped and then manipulated Matthew’s lips to mouth the different languages. BBC Click’s Lara Lewington finds out more…

See this content in the original post

The company has also applied its tech to soccer legend David Beckham. In collaboration with the campaign Malaria Must Die, Synthesia manipulated Beckham's facial features so that nine malaria survivors were able to speak through him — in nine different languages. Check out this video with David Beckham…

Foreign language dubbing becomes so much easier. Apparently the actual filming on the day was almost identical to a normal shoot. The only difference is ahead of the shoot Synthesia had to train their algorithm to learn Beckham’s face.

To do this, Beckham just had to talk to camera. There was no need for a script, although they sometimes need to suggest topics for people to talk about as they can dry up in front of the camera. To help people think of what to say like ‘what they had for breakfast that morning’.

This footage is then translated using an algorithm to learn how Beckham's face moves and then create a digital model of him. But unlike some systems we understand this process only takes about three to four minutes. CEO Victor Riparbelli explains…

“Once you've done that, it doesn't require any special hardware or cameras or anything like that."

There are other players getting into foreign language translation. For example, Resemble is designed to…

“Dub your native voice into 100 languages to reach a broader audience. Resemble’s AI voice generator engine can create any custom voice from your data source and localize it to other languages automatically.”

On their website are a set of examples…

Your browser doesn't support HTML5 audio

English

Your browser doesn't support HTML5 audio

French

Your browser doesn't support HTML5 audio

Spanish

It doesn’t take much imagination to see how localising film and TV content could be simplified by harnessing this technology.

What’s Next?

We can expect AI-powered tools, including automatic lip sync with AI algorithms able to analyse actors' lip movements and fit the ADR to match the ADR performance with the original performance. This could also include matching the style and performance as well as lip sync. All of this would streamline the ADR workflow, saving both time and money. But these time and cost savings will go further. In their Businessweek article, Bloomberg looked ahead to 2024 and asked, ‘Is There Any Hope for Hollywood?’

“AI will affect filmmaking, but it won’t replace the creatives. Not long after the end of the writers strike, major media companies started asking experts and management consultants how they can best use artificial intelligence.

The answer? Cut 20% to 30% of your costs. For all the fuss about AI replacing screenwriters and actors, it turns out its greatest utility is in reducing the cost of less glamorous fields such as post-production and marketing.

Dubbing, reshoots, teasers and promotional language can all get a lot easier with AI. That’s not to say the industry can let down its guard. The biggest risk to talent in Hollywood isn’t whether Disney uses AI to cut jobs. It’s whether some outsider uses copyrighted material to create a hit show without paying anyone a dime.”

Unesco, at their conference in Paris in October 2023, where they invited filmmakers to confront the fast-approaching age of artificial intelligence in their industry, asked How AI is reshaping filmmaking: Insights…

Christóbal Valenzuela is the Chilean-born CEO of a leading AI video generation company, Runway. Its technology was used in the Oscar-winning movie ‘Everything, Everywhere All at Once’.

He says AI is a groundbreaking new tool that will democratise filmmaking by making it quicker, easier and cheaper, creating opportunities for talented individuals armed with nothing more than their smartphones and AI tech.

Many in Hollywood, however, are deeply suspicious of this positive vision. Duncan Crabtree-Ireland, SAG-AFTRA’s chief negotiator, acknowledged that AI was here to stay, but said there was also a deep sense of fear that it could be abused by studio bosses and powerful corporations looking to maximise profits by cutting costs.

He said such concerns could be addressed by a fair agreement on informed consent and compensation.

“Each actor has just one face, just one voice. If the companies have free rein to train their AI algorithms on those faces and voices then the actor’s career could potentially be over. Our union won’t be advocating for AI to be stopped or banned from use in the entertainment industry. But we do believe companies using AI technology must be required to secure the informed consent of any individuals whose voice, performance, likeness or commercial property is being replicated or used to generate new material. And companies need to compensate these individuals fairly.”

Another reason filmmakers are reluctant to use AI is because of concerns about plagiarism.

Sarah Dearing of the International Affiliation of Writers Guilds said professional writers were worried about copyright but also had a moral objection to using original work that AI companies had taken without permission.

“I think we need to ask writers and other artists to opt in to providing their materials for artificial intelligence generative programmes. Opting out, we believe, is a bogus approach. Technically you really can’t. Once a model has been used, there’s no undoing it. So we would like to see licensing of artistic material and using only artistic material that has the consent of the creator in the AI programme.”

In Conclusion

There you have it: a look at some of the benefits and challenges of using AI in post-production. There is no doubt in my mind AI is here to stay and will continue to change almost everything we do. The benefits of AI as an assistant are clear.

How do we make sure that intellectual property is protected when we hear stories that generative AI is being trained on people’s creativity without permission? But how do we make sure that people’s identity is protected?

These are the big questions we all need to address. It’s good to see organisations like Unesco getting involved, but if we stay quiet and do nothing, people are going to be taken advantage of.

What About You?

Do you have experience with how AI is changing audio post-production? If so, do share them in the comments below.

See this content in the original post