If you are interested in AI-generated images and content, you might have heard of two popular models: Disco Diffusion and Stable Diffusion. Both of them use a technique called diffusion to create realistic and diverse images from text prompts. But what are the differences between them and which one should you use for your projects? In this blog post, I will compare their features, quality, and other important things.
What is diffusion?
Diffusion is a process of generating images by reversing the steps of adding noise to an image. Imagine you have a clear image of a cat and you gradually add more and more noise to it until it becomes a white noise. This is called corruption. Now, if you reverse this process and start from the white noise and gradually remove the noise until you get back the cat image, this is called diffusion.
To guide the diffusion process, we need a model that can tell how likely a pixel value is given the text prompt and the previous pixels. This model is trained on a large dataset of images and their captions using a technique called contrastive learning. The model learns to associate similar images and texts and distinguish them from dissimilar ones.
What is Disco Diffusion?
Disco Diffusion is a model that was released by OpenAI in June 2021. It uses a frozen CLIP RN50x4 text encoder and a custom UNet image decoder. It can generate images up to 512×512 resolution with 1000 diffusion steps. It supports multiple text prompts separated by commas or semicolons, which can be used to combine different concepts or styles. It also supports some special tokens such as |
for symmetry, @
for rotation, #
for color, and *
for repetition.
Disco Diffusion is known for its creativity and diversity. It can generate images that are surprising, surreal, or abstract. It can also handle complex scenes with multiple objects and backgrounds. However, it also has some limitations. It struggles with generating realistic faces, animals, or other living things. It often produces artifacts or distortions in the images. It also has difficulty with combining incompatible prompts or respecting spatial constraints.
What is Stable Diffusion?
Stable Diffusion is a model that was released by Steins in August 2021. It uses a frozen CLIP ViT-L/14 text encoder and a custom UNet image decoder. It can generate images up to 1024×1024 resolution with 2000 diffusion steps. It supports multiple text prompts separated by commas or semicolons, which can be used to refine or modify the images. It also supports some special tokens such as +
for addition, -
for subtraction, *
for multiplication, /
for division, ^
for exponentiation, and =
for equality.
Stable Diffusion is known for its realism and precision. It can generate images that are sharp, detailed, and consistent with the text prompts. It can also handle generating realistic faces, animals, or other living things. It can also combine different concepts or styles smoothly and naturally. However, it also has some drawbacks. It tends to be less creative and diverse than Disco Diffusion. It often produces images that are similar to each other or to the training data. It also has difficulty with generating abstract or surreal images.
Comparison
To compare the quality and diversity of Disco Diffusion and Stable Diffusion, Here are some example prompts and see what kind of images they generate from this video created by the Quick-Eyed Sky YouTube channel.
Conclusion
- Disco Diffusion is better for generating creative and diverse images that are surprising, surreal, or abstract. It can also handle complex scenes with multiple objects and backgrounds. However, it struggles with generating realistic faces, animals, or other living things. It often produces artifacts or distortions in the images. It also has difficulty with combining incompatible prompts or respecting spatial constraints.
- Stable Diffusion is better for generating realistic and precise images that are sharp, detailed, and consistent with the text prompts. It can also handle generating realistic faces, animals, or other living things. It can also combine different concepts or styles smoothly and naturally. However, it tends to be less creative and diverse than Disco Diffusion. It often produces images that are similar to each other or to the training data. It also has difficulty with generating abstract or surreal images.