Teaching AI to tell visually consistent stories
A new architecture for videos that actually make sense
We understand stories differently than we understand moments. A moment can be striking or beautiful on its own - a sunset, a dancer's leap, a smile. But stories work by building relationships between moments. Each scene has to flow naturally from the ones before it. Characters need to stay consistent. Actions need to have consequences that persist.
This difference between moments and stories points to one of the hardest problems in artificial intelligence. Current AI systems can generate remarkable individual video clips: faces speaking, people dancing, animals moving. But these systems fail when asked to generate anything longer. The character's face subtly changes between scenes. The movements become jarring and unnatural. The story falls apart.
A lot of us assumed this was simply a matter of scale - that with bigger models and more training data, AI would naturally progress from generating moments to generating stories. But one of the top papers on AImodels.fyi today shows how the gap between moments and stories requires fundamental innovations in how AI systems work.
Keep reading with a 7-day free trial
Subscribe to AIModels.fyi to keep reading this post and get 7 days of free access to the full post archives.