It is compatible with both standard footage and videos generated by artificial intelligence.
DeepMind, the artificial intelligence laboratory of Google, is now working on a new technology that has the capability to generate soundtracks and even dialogue to accompany films. The laboratory has provided an update on the development of the video-to-audio (V2A) technology project. This technology has the potential to be integrated with Google Veo as well as other video creation tools such as OpenAI’s Sora. The DeepMind team writes in a blog post that the system is able to comprehend raw pixels and integrate that information with language cues in order to generate sound effects that correspond to what is occurring onscreen. It is important to note that the program can also be utilized to create soundtracks for conventional material, whether it be silent films or any other video that does not contain any sound.
The researchers at DeepMind trained the technology using movies, audios, and annotations made by artificial intelligence. These annotations comprise extensive descriptions of noises and transcripts of dialogue recordings. They stated that the technology was able to learn to correlate particular sounds with visual images as a result of taking this action. As pointed out by TechCrunch, the team at DeepMind is not the first to offer an artificial intelligence tool that is capable of generating sound effects; ElevenLabs has also launched a tool in this regard recently, and it will not be the last. “Our research stands out from existing video-to-audio solutions because it can understand raw pixels and adding a text prompt is optional,” the researchers explain in their paper.
It is possible to use the text prompt to shape and modify the final product in order to make it as exact and realistic as possible, despite the fact that it is not required. You have the ability to enter positive prompts in order to direct the output in the direction of making sounds that you desire, for example, or negative prompts in order to direct it away from sounds that you do not care for. The following is an example of how the team took the following prompt into consideration: “Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete.”
The researchers acknowledge that they are still working to overcome the constraints that are now present in their V2A technology. One of these limitations is the potential for decreased audio quality in the output, which can occur when there are distortions in the source video. Additionally, they are continuing to work on enhancing the lip synchronizations for the generated lines of dialogue. A further commitment that they have made is that they will put the device through “rigorous safety assessments and testing” before making it available to the general public.