How close are we to AI-generated movies?
Imagine being able to create a 2-hour movie of production quality on par with Disney or Netflix, but without the need for actors, multi-million dollar budget, or even a full script. Imagine this entire process taking less than a week. Imagine it all being done by one person. What if I told you that this future is at most 2 years away, maybe less? We're already 50% of the way there, but don't take my word for it. Let me show you what we can already do today, and what we need to do to bridge the gap.
The Plot
While the current version of ChatGPT still struggles to write well (even if we use GPT4), it's great for brainstorming ideas and adding filler content. That alone can save you a ton of time. Moreover, a GPT trained in story-writing can help you tie up loose ends and make your story more coherent. There are common story templates the AI could follow to ensure our story has a proper build-up and climax. There are already tools that do this to a degree, such as StoryAI.
Of course, we all know that the longer the conversation with AI goes on, the more likely it's to be plagued with hallucinations. But there is already a version of GPT4 with a context window of 128k tokens (roughly 200 pages), which is enough to ensure our story stays coherent. Although, as I already explained in one of my videos, our story would be pretty bland without a human driving the plot. And this is where the bulk of the opportunity lies. We all have interesting stories to tell, but most of us lack proper delivery. Now we can have AI help us make our stories engaging.
Actors
No story would be complete without actors, and this is where things get interesting. There are already libraries like LangChain and Autogen that allow you to create AI agents. An agent is a GPT with a specific task, it retains a specific personality and mission. You can quickly build your own or use existing personas using tools like Character AI.
These agents can be used to wrap around characters, effectively creating AI actors. The agent would ensure a consistent personality for each actor, generating text and actions consistent with the scene. I suspect such logic will soon start appearing in game engines, allowing you to tie assets like voice and personality together into a coherent character. There are already LLM-powered NPCs. The beautiful part is that you wouldn't even need to micromanage these assets, you would simply direct your film and AI actors would redo the scene as many times as you want, without getting tired.
Speech
Each actor would need a unique voice, and we're almost at the point where AI voices sound natural. The best text-to-speech generator I've played with so far is ElevenLabs. ClipChamp editor has a baked in text-to-speech engine that's not bad either, with some voices sounding better than others (I like "Brian" and "Davis" voices). The rest of the tools I played with still sound robotic. There are still occasional words that get messed up or poorly placed pauses, but for the most part the speech is almost bearable. We'll probably see other solutions catch up within the next year. Here is an example of what's currently possible:
Music
Music is a crucial yet understated element to any story. It sets the mood and helps the audience connect with the characters. An upbeat tune can inspire, an intense one can leave us at the edge of our seats. Luckily, music is one of the easier things to generate with AI. There are a number of tools that already do a decent job, such as Soundraw and Boomy, which makes sense because music has repetitive sequences that are relatively easy for AI to generate. Here is a tune that I generated in Boomy:
Technically, you don't even need your music to be AI-generated. It just happens to be the cheapest way to generate music that no one can claim copyright on, since the copyright laws in US are asinine.
Sound Effects
Sound effects add a subtle level of immersion to the story, they help break down the fourth wall. There are tools like MyEdit that allow you to generate sound effects based on description. The quality is hit or miss, however, mainly stemming from AI "ignoring" part of the context. For example, "hammer hitting a nail" and "hammer hitting a hand" will sound almost identical - with a metallic echo to this sound. I'm sure this will improve rapidly in the next year, just like speech.
The other gotcha of sound generation is deciding whether a sound should be made at all based on the video content. This is a whole other can of worms that we haven't began to tackle with AI yet and something that would be tedious to do by hand. Imagine a video of a skier racing down a mountain. There are a number of objects that we may want sound effects for: the friction of the skis against the snow, the wind, the skier's breathing, the birds chirping, perhaps even the shouting of other skiers. How would AI decide which sounds are relevant? That requires understanding the intent of the scene, by observing the video content.
Video
There are two approaches to video generation. One approach is similar to how cartoons are drawn: we draw a frame, then animate it. The other approach is more akin to videogames: we create a scene with actors and assets, moving the camera around as needed.
Approach 1
We'll start by examining the first approach. This approach would work well for educational videos and documentaries, where we may want to take existing images and turn them into a video. It requires relatively little setup, but is also less scalable, something I will cover in more detail later. We'll start with images, which could be generated by AI or be real-life photographs we'll use as base.
Images
You've already seen what Dall-E is capable of from my previous posts. In fact, most of the images this post are also generated by Dall-E. While minor glitches are still common (such as deformed objects or things facing the wrong direction), Dall-E 3 got pretty good not just at drawing objects, but humans as well. It's able to create photo-realistic images, and even entire scenes. What it currently struggles with is keeping the scene consistent between images, but that's more of a limitation of OpenAI API than technology itself, since Dall-E 2 already allows inpainting and outpainting, which can be used to modify objects within the scene or extend the scene beyond the original frame. And Dall-E is not even the best image generator on the market.
Midjourney does a much better job generating life-like faces, capturing human emotion as well as subtle micro-expressions. It also does a better job with lighting and shadows. Although to OpenAI's credit, they claim that they intentionally add a cartoonish feel to their images to avoid having their images used for nefarious purposes (such as fake news). There are also open-source solutions like Stable Diffusion, if you have the hardware to run it.
(image taken from this Reddit thread)
Alternatively, if you're using real images as base, there are tools like Magnific AI for upscaling them. Magnific takes a blurry, low-quality image and redraws it to look like a high-quality version of the original, keeping the angle and colors consistent with the original intent. Now imagine similar logic being applied to video frames, but instead of just ensuring consistency with the original image, it would be redrawn to match the scene.
Animation
Of course we can't have movies without actual animation. Currently, consistent video generation is still the Achillees heel of AI. While there are video tools like Neural Frames, Runway, and Pika that do a decent job generating short (1-5 seconds) videos which you can in theory stitch together into a longer video telling a story, it's hard to keep the scene consistent between frames. And things get even harder when changing the angle. It's not uncommon to see frames with random garbage pop up, or scene morph in unrealistic ways. I believe we're in the same place with videos today as we were with images about a year ago, with Dall-E 2. The current generation of problems will probably be solved with next iteration of these video generators. Here is an example video I made in Neural Frames, using one of my photographs as a base and a prompt asking it to show the building getting renovated.
While it did the job, there are clearly irrelevant frames in the middle, showing windows and doors morphing for no reason (it also morphed the photograph into a cartoon, something I didn't ask for). Similarly, the model doesn't quite understand that the rest of the image shouldn't change much (except maybe to account for the seasons). It's not just the building itself that changed, but the entire neighborhood. And that's the weakness of this approach. It's susceptible to hallucinations, and it's hard to keep the scene consistent between frames.
Approach 2
The second approach requires more work upfront, but is more scalable, especially if we plan to reuse the scene for multiple videos. Instead of having the AI generate the video frames from images, the AI could create and control a scene within a game engine like Unity or Unreal. The benefit of this approach is better control of the scene, fewer hallucinations (such as window sizes changing) and not having to deal with image generation at all (aside from generating textures). I think this is the approach films and shows requiring consistency will take.
Scenes
CodeBullet has already explored this approach in his video on AI-generated Rick and Morty episodes. Of course to do so, we'll also need AI capable of generating assets (meshes, textures, etc.), and that would be another can of worms. However, Unreal Engine already has several tools for making asset generation easier, such as MetaHuman and Quixel Bridge, which will only get better with time.
This approach is more resistant to hallucinations, since objects can't morph randomly - nor can portions of the scene change that are not being controlled by the given actor. However, if you watch the above video by CodeBullet, you'll see that hallucinations can take a different form, such as characters behaving weird or moving in unnatural ways.
Tying It All Together
Right now, putting scenes together still requires quite a bit of manual work, but there are already tools like Descript automating some of it. Moreover, Davinci Resolve keeps integrating AI tools into their editor as well. Here is an example of a story a user on Reddit put together using the 1st approach I mentioned above. We can see that animation is still simplistic for now.
Here is another example, with the user creating an anime based on John Wick series. The animation is better here, but we're still limited to short 2-3 second bursts and it's hard to preserve consistency between different camera angles.
As for the second approach, we can already create impressive scenery with Unreal Engine 5, here is a demo their team put together back from 2021. These aren't real Keanu and Carrie-Anne. They're deepfakes, and while there is still a bit of uncanny valley effect to their faces, in two more years I'm sure that will be solved too. On top of that, this engine seamlessly generates cityscapes, pedestrians and traffic (if you want to see Unity's city generation in action, just fast-forward 8 minutes into the video).
Imagine being able to generate such scenes in minutes simply by describing them to AI. We're about to see a world with a lot more user-generated content. And let's be clear, the good content will still require a human director behind it. AI is not cruise control for creativity, it simply allows you to do mundane tasks faster, so you can focus on the big picture.