The architecture itself that’s used for parti (that’s what the authors call this model) is fairly simple. It’s a transformer encoder-decoder architecture paired with a ViT VQGAN in the end to tokenize/detokenize images.
For an autoregressive model to work, we basically have to convert everything to tokens. Tokenizing text is super easy. The problem in this case is that we also have to convert images into a sequence of tokens.
This tokenization of images is done by the VQGAN tokenizer. I have explained how this happens in one of my old blog posts.
The “Q” in VQGAN refers to quantized. The token space of the VQGAN is “quantized” to roughly 8k tokens. I think I should read the ViT VQGAN paper in case I ever have to work on image generation (which I hope might be soon).
Essentially, we do the following for training:
(T, I)
.I
using the ViT-VQGAN image tokenizer.[<SoT>, i1, i2, ..., iM]
(M
image patches) into the transformer decoder (along with a start of sentence <SoT>
token).I_reconstructed
.Another thing to note, is that the paper mentions that they used an image upscaling model to resize the output image from 256x256 to 1024x1024.
<SoT>
token into the transformer decoder.i1
[<SoT>, i1]
into the transformer decoder and obtain i2
.[<SoT>, i1, i2, i3 ...]
until we reach iM
where M
is the number of image patches required to construct one image.[i1, i2, i3, ...iM]
, we feed it into the ViT-VQGAN image detokenizer and obtain the “generated” image.In a nutshell, we autoregressively generate image tokens with the decoder.
The largest possible model they trained was a 20B param model. Yannic was saying that complaining about not have the exact composition on the generated images is already like “moving the goal post” at this point, given how far we’ve come from the 2018 styleGAN era.
I am barely educated to make comments on how good parti or DALLE2 is, but what I understand is that words are a very lossy way to compress an idea of an image.
You can compress a picture of a cat in a tree with “cat in tree”, the decompression just generates a picture of a cat in a tree, the loss is that it may be a different cat in a different tree.
My opinion is that in order to solve the problem of compositionality, it makes sense to look into the model itself and it’s mechanisms instead of just “going bigger”.
Going big is what companies with deep pockets like openAI/Google can do. But to me it also sounds like a rather lazy solution, given how little effort people put into understanding these black-box-y systems.