VQGAN+CLIP Tutorial

december 13, 2021

VQGAN+CLIP is a text-to-image model that generates images of variable size given a set of text prompts (and some other parameters). In this guide I will show you how to do your first image and how can you go forward for more complex results.

All the following steps are based on this notebook: https://colab.research.google.com/drive/1ZAus_gn2RhTZWzOWUpPERNC0Q8OhZRTZ

So, what is this notebook: basically it's someone's code that you can run over google's servers. To do this you need to connect your google account for them to assign you resources to run the code To start, first you click connect:

1.png

After that, it comes the first part of the code: selection of the libraries. By default the best one is already selected, so dont touch anything here:

2.png

Then, the main part: here you make the changes of what you want to execute:

3.png

text: the actual text you wanna do

width, height: size of the output image. Be aware that by default 600x600 is already using a lot of resources, a bigger image will be blocked. I suggest to move within those numbers.

model: dont touch

images_interval: while the program is running images will show the current result, in this case, every 50 iterations. Better to not have more than 50 or it will take too much time to check how the process is going. I was putting 10 so i can see much faster how the image was forming.

init_image: by default the process will always start from a blank image. If you want to start from an initial image, you can. You have to go to the left panel, click the folder icon and upload a picture. Then in this field change the name to the file name

4.png

Same thing with target_images

seed: (edit: this notebook has seed 42 as default, to have a "random seed" use -1) dont touch and you will always have different results. If you want to have the same result always, find the seed used after initializing the program and put that number in here

max_iterations: here i found that after 350-400 iterations the image stops to change. So 200 is a fair end, but you can try 400 and adjust accordingly

And that's it! after setting all up, you run the code in Runtime>Run All

5.png

Let's do an example with the following prompt:

6.png

After executing we start seeing the result of the iterations:

10.png

i=0 & i=10

11.png

i=50

12.png

i=100

13.png

i=150

14.png

i=200

15.png

After finished, all the 200 images are stored here:

16.png

The program now will generate a video using all these images, video.mp4, which you can download of course:

17.png

If you open up the Generate Video code, you can modify some parameters like total duration of the video, if you know ffmpeg you can do whatever you like in there:

18.png

After this it's just a matter of playing and trying to find ways to have different results :)

Some examples:

We can put the last iteration image as the init_image, that way the next attempt, whatever text you use, will start from the last "frame" of the video that way you can have many videos and later you can put them together in any video editor so you can have a longer continuous video.

Other trick, is that VQGAN+CLIP accepts adjectives, styles, etc. You can go nuts here! you can use things like:

Based on an artist

Beksinski style / Dali style / Van Gogh style / Giger style / Monet style / Klimt / Katsuhiro Otomo style / Goya / Miguel Angel (Sistine Chapel style) / Joaquin Sorolla style / Moebius style / in Raphael style

Quality or camera distortion:

4k / chromatic aberration effect / cinematic effect / diorama / dof / depth of field / field of view / fisheye lens effect / photorealistic / hyperrealistic / raytracing / shaders / stop motion / tilt-shift photography / ultrarealistic / vignetting.

Example of these are here: https://imgur.com/a/SALxbQm

You can also try using different notebooks, each one with different possibilities:

Latent3Visions

AlephXMoving

BigGAN DX

Aleph Dancer

Aleph 5.3

Text2Voxel

Get creative!