Learning the Limitations of Dall-E 3
I've played with Dall-E 2 a while ago and was not impressed with it. At the time the images were low quality (despite all the hype surrounding them) and often incoherent. Excess appendages and bashed-in faces were the norm, logos looked like someone on shrooms drew them, and the images were too small to be useful.
A lot has changed, and Dall-E 3 actually looks quite impressive. So impressive in fact, that I may start using it in my web-scraping (more about that in a future post). But before I do, I wanted to understand the limitations of the model, so I can avoid them in my use-cases. And in this post I'll share what I've learned.
To use Dall-E 3, you will need to pay for their Plus subscription ($20/mo). You're also limited to 40 messages every 3 hours, and images count against that quota (although there is a prompt floating around the web to trick Dall-E into generating more images without counting against your quota, but I haven't yet confirmed myself if it actually bypasses the quota as claimed).
To start off, I've asked Dall-E to imagine New York City 100 years after humans are gone, and this is what it came up with:
That's impressive, the overgrown grass is exactly what one would expect, and the sunrise adds depth to the image. The Empire State building firmly establishes the location as NYC, nice touch. This is a picture worthy to use for a cover of a blog post, whereas the ones from Dall-E 2 are so bad I wouldn't even put them on an NFT. We do see a limitation with the glass skyscraper in the back rendered in pristine condition, an issue I will explore in more detail later in this post.
Following the same theme, I've asked Dall-E to generate Boston in the same setting:
Since the non-descript buildings in the first picture don't really say "Boston", I've asked it to focus on a specific part part of Boston for the 2nd one: Boston Common. Not bad, Dall-E rendered the park along with Massachusetts State House. It got the location right. We can even see what looks like a shorter version of the Prudential Tower in the back. That's not quite where Pridential should be, but it is in the same general area. Dall-E does seem to understand what buildings are relevant based on context.
Let's try with Chicago:
It's impressive how many landmarks it got right for Chicago (although not their locations). But what's with the exotic cacti? That's probably a hallucination, let's try again:
The first image shows Chicago River with Wrigley Building right where it should be. What's odd is the location of the corn cob towers (Marina Towers), they should be behind us. It also placed them apart from each other, for some reason. The second image has elements of Chicago (such as John Hancock Center in the distance) but does not seem to have placed the river correctly. If we assume that the waterfront to the left is Lake Michigan (there is no other body of water near Chicago that could be this large), then we should either see Grant Park on the left instead of the buildings, or Interstate 41 if we're further north. In neither case we should see the river.
As before, neither of these images is geographically accurate, but they both show iconic Chicago buildings and a landscape that would definitely pass for Chicago to an untrained eye, especially if used as a backdrop for other content. Images made with Dall-E would be perfect in a real estate slide deck, but not if you're trying to create an accurate sketch/mockup.
Miami took a few tries to get right, and revelead more limitations with Dall-E:
While the architecture style in this picture feels like Miami, it doesn't seem to show any landmarks, and the focus seems to be on Miami Beach rather than one of the modern neighborhoods, like Brickell. Guiding the model to the specific neighborhood like we did in Boston didn't help, with image still looking similar but the buildings getting taller.
I've also tried asking it to focus on One Thousand Museum building, but it drew a generic museum instead. Asking it to focus on the Kaseya Center, which would probably have been known to it either as Miami-Dade Arena or FTX Arena at the time of its training cut-off (April 2023). Neither of the 2 worked, with Dall-E generating a generic arena in pristine state. When I reminded it that this building should have experienced 100 years of decay, it redrew the building with spider-like appendages:
Clearly, the word "decay" has a different connotation that's causing Dall-E to hallucinate. This goes back to my earlier observation about these models, your word choice can easily bias its thoughts. As a side note, Dall-E has been trained quite extensively on movies, cartoons, video games and comic books, so the model is quite imaginative. With the right prompts, you can even get it to violate copyright (but that's a topic for part 2 of this blog post).
I was eventually able to get it to draw an arena that looked similar enough, but as you can see it's still not the same building:
One other experiment I tried is attaching an image of the building I want Dall-E to use as basis to the prompt and asking it to generate the same version of it given the setting I described. This didn't work either. Instead Dall-E would create a generic building with a feel similar to the one in the original photo. It wouldn't even respect the original angle/layout, implying that it didn't actually see the image. I'm guessing GPT4 describes the image to Dall-E in words instead of feeding it the image directly.
Dall-E 2 editor allows in-painting, but with ChatGPT we don't get access to that and Dall-E 3 is not yet available through the same interface as Dall-E 2. It does look like editing existing images is still limited to Dall-E 2, even via their API. This is a big limitation, as it means you still can't use Dall-E to generate artistic renders of your buildings (which can be useful if you're selling unfinished construction, or land with building permits).
In the end, this was the version of Miami I liked the most:
Moving on, I've asked Dall-E to render San Francisco, and while we do see the landmarks, once again the locations are all off:
Detroit's GM building seems to be immune to the weather, but the rest of the city is showing visible signs of decay, like we'd expect:
We see the same problem with glass buildings in Dallas:
Dall-E 3 doesn't seem to understand that shiny glass on the skyscrapers ages too, at least not without nudging. When I reminded it that glass buildings should show signs of aging as well, it improved on the image:
The image is still not completely realistic, as the tempered glass (with expected lifespan of 20-30 years) should look way worse than concrete (with expected lifespan of 100+ years). The damage to these buildings looks like something out of a comic book (resulting from a shockwave rather than aging).
We also see that merely mentioning glass buildings in our previous prompt caused the model to insert a lot more of them into the image. It looks like it's very easy to get Dall-E 3 to fixate on a specific detail just by mentioning it, even if it's a detail you're trying to exclude. This is why negative prompts don't work well. Merely telling the model to avoid something can cause it to fixate on it.
And a great example of that is our next rendering, Las Vegas:
The image shows bright lights, typical of Las Vegas, and the model seems to be unable to imagine Las Vegas without the lights, even when there is no power running to them. Asking the model to make sure all the lights and displays are off only makes the problem worse. The model now starts to fixate on them:
Even telling the model to render rubble instead of the billboards wouldn't work. After multiple unsuccessful attempts, I was finally able to get the model to generate a more realistic image by asking it to mix in influences from ancient Egypt, an area with comparable climate but not the billboards:
As you can see the billboards are gone, but the architecture gained some Egyptian influences. These models aren't perfect, and you're always rolling the dice with them. As you play with them more, you gain an intuition for what works and what doesn't, but it's still a lot of trial and error. I should also mention that the images I've shown here were cherry-picked, and you should expect more iterations if you want to get your scenes to look right.
Moreover, as I already mentioned, it knows what buildings are relevant based on the context but will not place them in the correct location. Similarly, you may see multiple copies of a landmark in the same image. In several of the Dallas renders, the Fountain Place would be rendered 2-3 times, and I saw a similar trend in Chicago and Atlanta. I'll cover more of my experiments next week since this post is getting quite long.
If you want to see the full prompts I've used to generate these, I will share them with the members in my newsletter.