Midjourney has new competition

Stable Diffusion XL was released yesterday, and it is blowing me away.

Midjourney has new competition.

Since December I’ve been obsessively tracking AI, and the entire time the most effective image generator has been Midjourney. It’s images are incredible - rich, evocative, stylized - beautiful. This is because Midjourney has been given a LOT of feedback by users on what they like and don’t like.

This also presents a problem - Midjourney doesn’t always give you what you ask for, it gives you what it thinks you will like best. Your only alternatives were DALL-E - which is interesting but produces relatively crude and unattractive images, or Stable Diffusion, the open-source alternative that could produce incredible results, but required a lot more technical expertise to get involved with.

Open Source

Stable Diffusion, being open source, also attracted lots of other innovations - the model can be fine-tuned and many in the community took to the effort of recreating Midjourney’s level of photorealism, or performance in other areas - often training on Midjourney images themselves.

But while these models did improve the quality of output - they also restricted its steerability. Anyone who uses these tools knows how hard it can be sometimes to get it do exactly what you want.

In fact, I recently have been developing TexTex, a daily AI-wordle like game, and spent a lot of hours frustrated trying to get Stable Diffusion to reliably depict the scene that the player had described - it would give you an image, but it was generic, average - it didn’t really follow the story.

Stable Diffusion XL does what you tell it

Just yesterday, Stable Diffusion XL was released to the world. This update is a new, foundational model and has lots of technical advantages - none of which we care about, other than that the images are higher resolution now (1024 from Stable Diffusion’s 512).

What we care about is that it is very steerable. This is important - at the end of the day, we don’t care about how good an image looks because it just looks good in some isolated way - we care because it is doing something for us, helping people understand our stories. So it really matters that it does what you want.

Shoutout to Invoke.AI 

InvokeAI is a great new UI for Stable Diffusion models.

One other shoutout - I have recently switched to using InvokeAI to generate my images and it has been a blast. The workflow is much improved, and there are so many quality of life improvements. They’ve even integrated ControlNet and made downloading SDXL a snap. I run this on my local PC but they offer a cloud alternative for those without beefier PCs.

This is how I will show off image AI going forward.

The Experiment

I’m going to generate 5 sets of images, using the following Image Generation AIs and considering two main criteria: steerability and aesthetic.

  • DALL-E 2: OpenAI’s very accessible but ugly model. I predict it will be somewhat steerable, but not aesthetically appealing.

  • Midjourney: The undisputed king of Image Generation - until now? Midjourney is decently steerable but has some known biases - such as anachronism (depicting people in old styles) and it warps everything toward modern digital art styles.

  • Stable Diffusion 1.5: This was the first major open source image model and has been the focus of community innovation. As a base model, it is flexible but often downright ugly.

  • Deliberate (SD1.5): A fine-tuned version of SD1.5 that aims to create more aesthetic images - representative of the best that SD1.5 can do.

  • Stable Diffusion XL: Released on Wednesday, July 26th - it is Stability.AI’s answer to Midjourney - how will it perform on our criteria?

I’m going to use the same prompt for Midjourney, Stable Diffusion 1.5, DALL-E and Stable Diffusion XL so that you can see the difference. To generate our prompts, I’ll ask GPT4 to come up with 5 scenarios that would test and showcase steerability in particular, with multiple subjects, complex scenes, outlandish goings-on and lots of action. Here are the scenario’s GPT4 came up with for us.

  • Victorian Safari on the Moon

  • Coral City Rush Hour

  • Robot Jungle Jamboree

  • Alien Market in Renaissance Fair

  • Time Traveler's Duel at Sunset

Victorian Safari

“In a surreal, steam-punk style, Victorian-era explorers ride alien beasts across Moon's cratered landscape, a giant Earth rising behind. Whirling gears, fluttering petticoats, and towering brass telescopes abound.”

DALL-E 2 - Neat

Midjourney loves airships.

Stable Diffusion 1.5 - interesting…

Deliberate (SD1.5) - It is the closest to getting the idea of the earth behind.

SDXL - The only image that got the ‘alien beast’ concept fully - no petticoats though.

Coral City Rush Hour

“Styled like a vibrant anime, diverse aquatic species wearing futuristic gear navigate a bustling, bioluminescent coral cityscape, with dolphin-drawn cabs and speedy seahorse scooters.”

DALL-E 2 0

Midjourney - lots of concepts but disconnected.

SD1.5 - Not what I’m looking for

Deliberate (SD1.5) - I like that dolphin shaped boat at the bottom!

SDXL - Seems like nothing can get the idea of Dolphins/Seahorses pulling vehicles.

Robot Jungle Jamboree

“In a glossy, digital-art aesthetic, myriad robots of all shapes, sizes, and functions engage in a boisterous, gear-grinding festival amidst dense, mechanical foliage.”

DALL-E 2 - it has its own taste.

Midjourney - always aesthetic

SD1.5 - really got the diversity of robots I think.

Deliberate (SD1.5) - I like how it used the height of the aspect ratio

SDXL - It’s got the festival idea (missing from others)!

Alien Market in Renaissance Fair

“Painted as a detailed medieval tapestry, extraterrestrial beings sell interstellar trinkets and exotic foods at stalls nestled amongst old-time jesters, knights, and peasants.”

DALL-E 2 - it loves these impressionist~ takes.

Midjourney - Stunning, but too focused.

SD1.5 - Even the original tapestries can be nightmare fuel, but this…

Deliberate (SD1.5) - This is great if the aliens are made out of clothes stands.

SDXL - Now we’re talking - this really captures the vibe of the prompt.

Time Traveler’s Duel at Sunset

“Rendered in dynamic comic book style, two time travelers, one medieval knight and one futuristic soldier, engage in an epic showdown atop a desert mesa, with a fiery, crimson sky in the backdrop.”

DALL-E 2 - blah

Midjourney - going for an epic feel.

SD1.5 - You can see the model is being pushed here - it doesn’t like the vertical format (trained on square images).

Deliberate SD1.5 - Suffering from that same basic limitation (square vs vertical aspect)

SDXL - None of the models really get the separation in identity between these two characters - one is supposed to be from the future.

Conclusion & Takeaways

Midjourney still has its place and will likely remain the most photorealistic and aesthetic app for some time to come, but SDXL is here, its images are greatly improved and approaching Midjourney’s quality while being very steerable, reactive to your prompts - and is open source and available for you to run privately in your own environment. If you need help to explore this, just reach out.

For a self-serve option, check out invoke.ai and consider using their cloud service if you’d like to give SDXL a try.