DreamSim

Everything related to this paper can be found here: : https://dreamsim-nights.github.io/

A sense of sameness is the very keel and backbone of our thinking - William James, 1890

Our understanding of the visual world depends crucially on our ability to perceive the similarities between different images.

People in vision have tried to solve this problem by introducing metrics like PSNR, SSIM, LPIPS, DISTS etc. The primary weakness of such metrics is that they focus on pixel/patch level information and fail to capture high-level structure.

This problem of not being able to encode high-level information from images has been partially solved by DINO, CLIP. But the authors find out that although these models are good for measuring image-to-image distances, they are not necessarily well aligned to human perception.

In order to train a model whose embeddings are more aligned to human perception, the authors decided to build a dataset.

Human-aligned dataset

In order to gather data, they applied something called Two alternative forced choice (2AFC). The idea was simple:

  1. Show a participant 3 images: [x, x1, x2] where x1 and x2 are distortions of x.
  2. Ask participant to select the image which is more similar to x.

The final dataset that was collected contained 20k such triplets with 7 unanimous votes for each on average.

Training a model

They finetuned existing image encoders first by adding an mlp head and then with LoRA. The latter outperformed the former. Ensembling some of the top-performing models gave them the best possible human-alignment score (2AFC score).