How do they render text as an image?
Text inputs are rendered on blank images, and are subsequently dealt with entirely as images. They literally just “write” the text on a blank image and then just treat it as an image. An accidental advantage of this kind of a method is that it removes the need for a tokenizer.
A common loss objecive used in these kinds of tasks is the contrastive loss.
When training models on Image/alt-text pairs, two encoders are usually trained with a contrastive loss, encouraging the embeddings of corresponding images and alt-text to be similar, and at the same time to be dissimilar from all other image and alt-text embeddings.
Questions - what about text prompts which are too long? what do you do then? - can we move smoothly between visual concepts by interpolating the “image containing the text”?