Stable Diffusion can be used for many applications. One of those interesting applications are image segmentation and Inpainting. This leaverages the text-to-image capabilities of diffusion models to generate photo-realistic images by replacing objects using masks.

In this article, we will perform image segmentation by a text prompt, then once we confirm that our prompt selects the right object we move to next steps which is replacing it with something else.

Setup

First, we need to install some libraries include diffusers and import all needed modules.

%%capture
%%bash

pip install --upgrade accelerate diffusers transformers

from PIL import Image
import requests
import torch
from torch import autocast
import matplotlib.pyplot as plt
from transformers import CLIPSegProcessor, CLIPSegForImageSegmentation
from diffusers import DiffusionPipeline

Segmentation

To perform the inpainting task we need a mask for the objects that need to be replaced. Instead of creating the mask manually, we will leveraging CLIPSeg which is a diffusion model capable of performing image segmentation given a text prompt. To learn more about this model you can refer to the paper.

This model is available from Hugging Face (see link) so let's get it.

processor = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")

We need an image to play with, let's download one (you can use something else too)

url = "https://images.unsplash.com/photo-1587080413959-06b859fb107d?ixlib=rb-4.0.3&dl=guy-basabose-FzdEbrA3Qj0-unsplash.jpg&w=512"
image = Image.open(requests.get(url, stream=True).raw).resize((512, 512))

image

Let's try selecting the cup in this picture as well as something else just so we could see how the model perform.

prompts = ["a cup", "beans"]

inputs = processor(text=prompts, images=[image] * len(prompts), padding="max_length", return_tensors="pt")

After processing the input, we can now run the model to generate the mask for each text prompt

with torch.no_grad():
  outputs = model(**inputs)

preds = outputs.logits.unsqueeze(1)

Now we visualize the predictions

num_preds = len(prompts)
_, ax = plt.subplots(1, num_preds+1, figsize=(15, 4))
[a.axis('off') for a in ax.flatten()]
ax[0].imshow(image)
[ax[i+1].imshow(torch.sigmoid(preds[i][0])) for i in range(num_preds)];
[ax[i+1].text(0, -15, prompts[i]) for i in range(num_preds)];

This is great, the model is able to select the cup when we pass the prompt a cup. Let's move on to next stage and see if we could replace this coffee cup with some delicious deserts

In-painting

The inpainting pipeline is available on Hugging Face (see stable-diffusion-inpainting), let's download it.

pipe = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    custom_pipeline="text_inpainting",
    segmentation_model=model,
    segmentation_processor=processor
)
pipe = pipe.to("cuda")

Let's first try the pipeline with a task of replacing the cup with why not a cupcake!

text = "a cup"      # will mask out this text
prompt = "cupcake"  # the masked out region will be replaced with this

with autocast("cuda"):
    image = pipe(image=image, text=text, prompt=prompt).images[0]

image

This looks great, let's try different replacements objects and each time store the generated image.

%%time

text = "a cup"
prompts = ["cupcake", "cheesecake", "ice cream", "Butterscotch budino", "tiramisu", "panna cotta", "cannoli", "mascarpone", "Affogato chocolate mousse", "granita", "Zuccotto", "syllabub", "semifreddo", "chocolate cappuccino buttercream", "Florentines", "zabaglione"]

for prompt in prompts:
    output_image = pipe(image=image, text=text, prompt=prompt).images[0]
    filename = f'{prompt.replace(" ", "_")}.jpeg'
    output_image.save(filename)

CPU times: user 6min 52s, sys: 41.3 s, total: 7min 34s
Wall time: 7min 37s

Let's make a video out of those images, check the result here.

!rm -rf cakes.mp4

!ffmpeg -framerate 1 -f image2 -s 512x512 -pattern_type glob -i '*.jpeg' -vcodec libx264 -crf 10 -pix_fmt yuv420p cakes.mp4

That's all folks

Stable Diffusion is a flexible and enables many interesting applications. In this article, we saw how Stable Diffusion can be used to perform zero shot segmentation as a tool to select objects by their text description. Then, we used another Stable Diffusion model to replace the select objects with something else.

I hope you enjoyed this article, feel free to leave a comment or reach out on twitter @bachiirc.