A new LLM VideoPoet from Google can create videos using text, audio, and graphics.
A new large language model (LLM) that is multimodal and generates movies was announced by Google at the same time that Midjourney and Dall-E 3 were making tremendous advances in the field of text-to-image conversion. Unlike any other LLM, this model is equipped with video creation capabilities that have never been displayed before.
In order to make videos, researchers at Google have developed a program called VideoPoet, which they describe as a powerful LLM that is able to interpret multimodal inputs including as text, photos, video, and voice. VideoPoet has implemented a methodology known as “decoder-only architecture,” which enables the company to generate content for activities for which it has not been specifically educated. It is said that the training of VideoPoet consists of two processes that are comparable to those of LLMs. These steps are pretraining and task-specific adaptation. According to the findings of the researchers, the pre-trained LLM is essentially the foundational framework that can be adapted to perform a variety of video creation jobs.
A article on the website states that “VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator.” This is a statement that can be found on the website.
What distinguishes VideoPoet from other similar services?
VideoPoet is a video model that incorporates multiple video generating capabilities into a cohesive language model. This is in contrast to the prevalent video models, which use diffusion models that add noise to training data and eventually duplicate it. The VideoPoet model incorporates all of its components into a single LLM, in contrast to other models that have many components that are trained independently for different tasks.
In addition to text-to-video and image-to-video conversions, the scientists assert that the model is particularly effective in video inpainting and outpainting, video stylization, and video-to-audio creation. The model in question is referred to as an autoregressive model, which indicates that it generates output by drawing inspiration from what it has also done in the past. Through the use of tokenizers, it has been trained on video, audio, image, and text in order to translate the input into several modalities. Tokenization is a method that is used in the field of artificial intelligence. It involves translating the text that is being input into smaller units that are also known as tokens. These tokens could be words or subwords. Because it enables artificial intelligence to grasp and analyze human language, this is an essential component of natural language processing.
The findings, as stated by the researchers, are evidence of the very promising potential that LLMs possess in the field of video creation. For the foreseeable future, they anticipate that their system will be able to handle the ‘any-to-any’ format.
It’s interesting to note that VideoPoet also has the capability of mixing many video snippets to produce a short film. In order to compose a brief screenplay, the researchers requested that Google Bard be provided with prompts. After some time had passed, they compiled all of the elements into a short film and created a video based on the prompts.
However, it does not have the capability to make videos that are longer. It has been said by Google that VideoPoet is capable of overcoming this limitation by conditioning the final second of videos to produce predictions about the subsequent second. The model is also capable of taking movies that already exist and altering the movement of the items that are contained within them. This can be best understood by referring to the Mona Lisa’s mouth opening and closing.