IBL News | New York
Google introduced this week Lumiere, a text-to-video generation AI model designed for to portray realistic clips. It’s one of the most advanced text-to-video generators yet demonstrated, although it is still in a primitive state.
Existing AI video models synthesize keyframes followed by temporal super-resolution. But Google uses a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model.
“We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation,” said the company.
Lumiere does a good job of creating videos of cute animals in ridiculous scenarios, such as using roller skates, driving a car, or playing a piano. It’s worth noting that AI companies often demonstrate video generators with cute animals because generating coherent, non-deformed humans is currently difficult.
As for training data, Google doesn’t say where it got the videos it fed into Lumiere, writing, “We train our T2V [text to video] model on a dataset containing 30M videos along with their text caption. [sic] The videos are 80 frames long at 16 fps (5 seconds). The base model is trained at 128×128.”
Other video generators are Meta’s Make-A-Video, Runway’s Gen2, and Stable Video Diffusion, which can generate short clips from still images.