Sora has not publicly launched yet, but its first contender has popped up. Meet Vidu, the Chinese answer to Sora. Vidu is China’s first long-duration, highly consistent, and highly dynamic video model. A Sora killer? Check it out.
Vidu is imaginative, can simulate the physical world, and “produce 16-second videos with consistent characters, scenes, and timeline.
Vidu: From prompt to video
Filmmakers have not decided yet how to ingest Sora, and its contender has emerged. Meet Vidu- China’s first long-duration, highly consistent, and highly dynamic video model. Vidu can generate a 16-second 1080p video with one click. Developed by Chinese AI firm Shengshu Technology and Tsinghua University, Vidu’s capability lies in its Universal Vision Transformer (U-ViT) architecture. “Vidu is the latest achievement of self-reliant innovation, with breakthroughs in many areas,” said Zhu Jun, chief scientist at Shengshu who is also deputy dean at Tsinghua’s Institute for AI, announcing the model at the Zhongguancun Forum held in the Chinese capital reporting by Beijing News. Vidu is ‘imaginative’, “can simulate the physical world” and “produce 16-second videos with consistent characters, scenes, and timeline”, Zhu said, adding that the model is also able to comprehend “Chinese elements”.
In order for Sora to produce a one-minute clip, it needs eight Nvidia A100 Tesor Core GPUs to run for more than three hours.
Demo clips related to Sora
During the model’s unveiling, Shengshu released several demo clips, including one featuring a panda playing the guitar while sitting on the grass and another of a puppy swimming in a pool, both showing vivid details. The generated imagery was chosen to their similarity to Sora, and in order to make a point, that this is a solid contender to Sora. Indeed, Chinese outlets confirm that Vidu’s debut has raised hopes in the country, which is racing to catch up with leading global generative AI players, such as Microsoft-backed OpenAI. By the way, you can read here more about filmmakers using Sora as a tool. It’s not as easy as we think it is. It’s not just prompt-to-video but involves many editing techniques and a long post-production process. Sometimes it will be easier just to shoot things with your camera.
Comparison to Sora
In order for Sora to produce a one-minute clip, it needs eight Nvidia A100 Tesor Core GPUs to run for more than three hours, according to Li Yangwei, a Beijing-based technical consultant working in the intelligent computing sector. “Sora requires a lot of computing power for inferencing,” has said and implies that Vidu demands much less than that. That’s interesting as we didn’t hear anything from OpenAI regarding the power horse needed to generate video via Sora. Technically speaking, it merges the strengths of both diffusion and transformer-based text-to-video models, by imaginative capabilities, the ability to simulate the physical world, and the capacity to generate 16-second videos with consistent characters, scenes, and timelines. Furthermore, Vidu is constructed on a proprietary visual transformation model architecture called the Universal Vision Transformer (U-ViT). Developers have indicated that this architecture combines two text-to-video AI models: the Diffusion and the Transformer. This architectural framework facilitates the creation of lifelike videos featuring dynamic camera movements, intricate facial expressions, and authentic lighting and shadow effects. You can find more technical details here. Zhu noted that the introduction of Sora resonated with their technical direction, intensifying their resolve to continue their research efforts. For now, Vidu is inferior to Sora, but the slope is steep so in a year it can look much better. Explore the demonstration below:
So what do you think? A Sora killer?