The core idea is to be able to use the hidden info within transformers to localize objects (“subjects”) within input images (much like a YOLO model but without further training).
First, we assume that there’s at least one object to be found in the image.
it relies on a selection of patches that are likely to belong to an object. We call these patches “seeds.
What are these seeds ?
We select a seed based on the assumptions that: * regions/patches within objects correlate more with each other than with background patches * an individual object covers less area than the background
We should also know that patches in an object correlate positively with each other but negatively to the background patches.
We pick the first seed by picking the patch with the smallest number of positive correlations with the other patches.
We then build a patch similarity matrix which says how similar each patch is to every other patch.
The next step is to find the next patches which are also likely to fall into the object. We do this by relying on the fact that pixels within an object tend to be positively related.
We keep expanding using this rule for as long as there are patches which are positively correlated to the seed patch.
Finally, we draw a bounding box around the final patch.
The implementation can be found here.