MindEye: fMRI-to-Image with Contrastive Learning

What are they doing?

The authors found a way to retrieve and reconstruct images from brain signals (FMRI). The 2 aspectes i.e retrieval and reconstruction are handled by 2 different modules.

For retrieval, they use contrastive learning.
For reconstruction, they use diffusion.

Note that retrieval means mapping FMRI signals to a useful embedding space like that of CLIP image emebddings. Which can be used for similarity-search type tasks.

Key insights:

Using really big linear layers do not lead to overfitting, instead it directly improve the model’s performance. (I believe this is for the retrieval part)
A bidirectional version of mixup contrastive data augmentation improves model performance in a low-sample setting (like that of ours)
CLIP embeddings are not good at holding perceptual information (like locations of objects, colors etc). This is because CLIP image encoders were trained to maximise similarity with text.
To solve the problem that arises in (3), the authors also build a low-level pipeline which preserves low level perceptual information, which would be useful for reconstruction.

Other important things - All of their models are trained on a single A100 GPU with a batch size of 32 - They used the Natural Scenes dataset.

Contrastive training

They use CLIP-loss as their training objective.
They use fancy techniques like:
- Mixup: trains models on synthetic data created through convex combinations of two datapoint-label pairs
- MixCo: mixup + InfoNCE Loss (this shows improved classification performance)
- They also modify the CLIP-loss to utilize MixCo

Things that I do not understand well:

What is Mixup?

Traditional deep-learning trains a model F(x) to return a y such that it is as close to the label as possible.

So if there’s a dataset with 2 samples, say [(x1, l1) and (x2,l2)] then we train our models on only these 2 datapoints and nothing else.

But in Mixup, we train not only on the 2 dataset samples that we have, but also on N datapoints which lie “in between” x1 and x2. We also interpolate the labels accordingly as we do this.

What is MixCo?

In contrastive learning, we train a model to embed the representations of differently augmented versions of the same image (positive pair) to be similar, while making them dissimilar if they came from different images (negative pair).

MixCo extends the idea of mixup to the domain of contrastive training objectives.

We “mix” images by generating semi-positive pairs since the images before mixing are originally negative pairs of each other.

Useful extra links:

Great video on Mixup: https://www.youtube.com/watch?v=hGAKHKqmXdY
Yannic’s video on Mixup: https://www.youtube.com/watch?v=a-VQfQqIMrE
Blog post on InfoNCE: https://jxmo.io/posts/nce