Project 5: Fun With Diffusion Models

Part A: The Power of Diffusion Models

In this project, we explore the implementation and application of diffusion models in denoising images, which can be used to sample "novel" images from pure noise and generate cool artwork/visual anagrams.

Testing Prompts

We use a pretrained model called DeepFloyd to save compute, which is originally intended for text-to-image use. We get around this by using a "null" prompt of "a high quality photo" which serves as a general prompt, but here we also display some of the pretrained text to image capabilities:

Throughout the notebook used to run the code, we use seed = 180 for reproducibility.

Prompt: an oil painting of a snowy mountain village

Prompt: a man wearing a hat

Prompt: a rocket ship

One factor influencing the quality and clarity of the generated images is the prompting itself, as a simple prompt like "a rocket ship" is much less detailed than the more complex prompt "an oil painting of a snowy mountain village".

These images were generated using num_inference steps of 20, but if we increase this number to 200 on both stage outputs, the model generates higher quality images at the expense of running the diffusion process for longer on the initial images:

Prompt: an oil painting of a snowy mountain village

Prompt: a man wearing a hat

Prompt: a rocket ship

Implementing the Forward Process

First, we need a way to add noise to clean images at different levels. We implement a function that adds scaled Gaussian noise to images, and show the results at levels 0 (no noise), 250, 500, and 750. These levels correspond to scale factors chosen by the creators of DeepFloyd, and roughly represent the transition between the pure noise and the target image.

250

500

750

Classical Denoising

We can first try simple methods for denoising, such as Gaussian blur filtering to see the results. Using a kernel size of 5 and sigma of 2, we obtain the lackluster results of the blurred/denoised images:

250

500

750

250 blurred

500 blurred

750 blurred

With lower noise levels, the blur denoising works alright, but as the noise increases it becomes more obvious, and we also sacrifice image clarity in this way.

One-Step Denoising

We can improve this performance by using a diffusion model to predict the noise in the image. In this section, we use a pretrained model for simplicity, to which we can feed the noisy image and the timestep/scale of the noise t, and get a noise estimate as an output, then scale this noise by the inverse factor we multiplied by in the original forward step. The results for model-predicted denoising is shown for the three levels:

Original

250

500

750

Original

One step 250

One step 500

One step 750

Iterative Denoising

In order to improve our performance, we can instead iteratively denoise the image by iterating on our timesteps, and slowly subtracting the predicted noise at each level. The formula involves the noise scaling factor we originally noised the image with, then we attempt to predict the next cleaner step of the image by getting the model's estimate, then subtracting a scaled version of it from the blurry image, then add back some additional noise to aid in our "search" for the optimal image. Once we reach the final timestep, we have found our clean image, or our best estimate of it.

We create our timesteps as a linear interpolation between 990 and 0, with a step size down of 30 each time, and show the denoising process every fifth step, starting from t=690.

t=690

t=540

t=390

t=240

t=90

t=0 (iteratively denoised)

Original

Fully noisy

Denoised (one-step)

Denoised (blur filter)

Diffusion Model Sampling

Now that we have a iterative denoising framework in place, we can begin to sample/generate novel images from the model. We accomplish this by feeding an image of complete noise into the denoising process, so that the image generated is completely dependent on the random path that the model chooses to the image domain. By using the null prompt "a high quality photo" we generate 5 random images below:

Classifier-Free Guidance (CFG)

We can improve the quality/sensibility of our generated images by using classifier-free guidance, which is also generating an unconditional noise estimate alongside the conditioned one using the empty prompt "", then combining them with a scale of 7.0 on their difference. The scale factor controls the strength of this guidance. We get the following images using CFG:

Image-to-image Translation

We can edit existing images by applying noise to them, then using CFG to denoise them back to the original image domain. As we continue to increase the amount of added noise, the resulting image becomes further from the original, leaving room for "creative" interpretation. The following images are at different i_start values, where i_start is the timestep start/noise start, i=0 being the noisiest.

Campanille test image:

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

We also see the results for two other test images:

Bottle on a table:

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

Laptop:

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

Editing Hand-Drawn and Web Images

This technique can also be used with web/hand drawn images. In the first example, we start with a cartoon image of a penguin from the web:

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

We also try this on a couple hand-drawn images:

Some buildings:

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

Some "trees":

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

Inpainting

We can also selectively sample on only certain parts of the image, or "inpaint" the image by creating a mask over the image region we want to replace, then after each step of the diffusion process, replace everything outside the mask back to the original image. This way, we selectively generate inside the mask while keeping the seams between the in and outside relatively flawless.

The campanille test image:

Original

Mask

To be replaced

Output

We also test on two more images/masks:

Water bottle:

Original

Mask

To be replaced

Output

Computer:

Original

Mask

To be replaced

Output

Text-Conditional Image-to-image Translation

So far, we have been using uninteresting prompts and rather random directions for generation. If we instead use a text prompt to guide our model, we can get images that look much more like the goal in mind. In the following images and noise levels, we use the prompt "a rocket ship" at various i_start levels on the campanille test image.

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

In this example, we use the prompt "a lithograph of waterfalls" with the image of bottles on a table:

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

And finally, we have an image of a pair of chopsticks prompted with "a pencil":

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

Visual Anagrams

We can also create optical illusions using diffusion models by including an intermediate step in the denoising process. If we want an image for example that looks like one thing upside down, but another the right side up, then we can do the denoise step regularly, then flip the image and denoise again, then average the results in each step. The results of this process for the prompts "an oil painting of an old man" and "an oil painting of people around a campfire" is shown below:

Also using "an oil painting of an old man" and "a man wearing a hat":

And "a lithograph of waterfalls" with "a photo of a man":

Old man

Old man flipped (campfire)

Old man

Old man flipped (man with hat)

Waterfalls

Waterfalls flipped (man in suit)

Hybrid Images

We can create a similar effect with high and low frequencies to see one image when squinting (low freq) and another when up close (high freq). The process is very similar, except instead of flipping, we take the low frequencies of one diffusion output, then the high frequencies of another prompt, then combine them for each step.

The first example has low freq prompt "a lithograph of a skull" and high freq prompt "a lithograph of waterfalls":

The second picture has "a lithograph of a skull" and "a photo of a dog":

The third picture has "an oil painting of an eye" with "an oil painting of an apple":

Skull / waterfalls

Skull / dog

Eye / apple

Part B: Diffusion Models from Scratch

In this next part, we implement diffusion models from scratch using the U-net neural network architecture. We train our model on the MNIST handwritten digits dataset, then apply the same denoising and sampling processing to generate digits from pure noise.

Training a Single-Step Denoising UNet

Our first goal was to build a one-step denoiser like in part A.

Implementing the UNet

Using PyTorch, we implement the general structure of the U-net as described below:

Using the UNet to Train a Denoiser

We first need a way to apply noise to clean images, in this case standard normal noise scaled by a constant sigma. The noise levels are shown below for varying sigma:

Training

In order to train the model, we take clean images of digits and apply Gaussian noise to them with sigma = 0.5, then run gradient descent on the mean squared error between the clean image and predicted output. Below is the training loss curve over 5 epochs.

Testing our model on test set images also made noisy with sigma 0.5 yields the following results for training after epoch 1:

and the following results after epoch 5:

Out-of-Distribution Testing

Further, to test on sigma/noise values other than 0.5, we also did out-of-distribution testing, noting the lower quality for higher noise levels it was not trained on:

Training a Diffusion Model

Now, instead of just directly predicting the clean image from the noisy, we attempt to predict the noise in the image instead, much like the DeepFloyd model from part A. We create a list of scaling factors and their inverses, alpha and beta, then iteratively subtract small amounts of noise from the image like in part A.

Adding Time Conditioning to UNet

In order to additionally condition the model output on the timestep t, we need to inject the timestep into the network using a fully connected block, and then sum into the intermediate steps. The timestep is scaled to be in [0, 1] before injecting to better be understood by the model.

Training the UNet

Training the model follows a similar process to before, but this time we generate a random t for each image and apply that level of noise to the image prior to input, so that the model is exposed to all levels of noise and time. The training loss curve (in log scale) is shown below:

Sampling from the UNet

Finally, we can sample from the model like in part A, to generate digits from pure noise. We start with a pure noise image, then iteratively denoise with decreasing timesteps, until we eventually reach timestep 0, a theoretically clean digit.

Sampling results after epoch 5:

Results after epoch 20:

There are clearly some problems with this approach, as many of the generated digits do not make sense. We fix this in the next section.

Adding Class-Conditioning to UNet

We can add the option to additionally condition our model on the class (aka digit 0-9) of the image. This allows us to more specifically sample digits from each class, by passing in a one-hot vector representing the digit. We can also choose to unconditionally generate by passing in a vector of all zeros. We can account for this null condition by randomly setting some class vectors to 0 during training, with probability 10%. The training loss curve in log scale is shown below:

Sampling from the Class-Conditioned UNet

Much like in part A, we need classifer-free guidance to assist our results, so we generate a null result alongside a class-conditioned result then combine them with a scale factor of 5.0. Otherwise the sampling process is much the same as the other iterative denoising cycles.

Sampling after epoch 5:

After epoch 20: