The authors found a way to retrieve and reconstruct images from brain signals (FMRI). The 2 aspectes i.e retrieval and reconstruction are handled by 2 different modules.
Note that retrieval means mapping FMRI signals to a useful embedding space like that of CLIP image emebddings. Which can be used for similarity-search type tasks.
Key insights:
Other important things - All of their models are trained on a single A100 GPU with a batch size of 32 - They used the Natural Scenes dataset.
Contrastive training
Things that I do not understand well:
Traditional deep-learning trains a model F(x) to return a y such that it is as close to the label as possible.
So if there’s a dataset with 2 samples, say [(x1, l1) and (x2,l2)] then we train our models on only these 2 datapoints and nothing else.
But in Mixup, we train not only on the 2 dataset samples that we have, but also on N datapoints which lie “in between” x1 and x2. We also interpolate the labels accordingly as we do this.
In contrastive learning, we train a model to embed the representations of differently augmented versions of the same image (positive pair) to be similar, while making them dissimilar if they came from different images (negative pair).
MixCo extends the idea of mixup to the domain of contrastive training objectives.
We “mix” images by generating semi-positive pairs since the images before mixing are originally negative pairs of each other.