A Deeper Look at the DINOv3 ConvNeXt Features
Recently, meta dropped the blockbuster people weren't sure would come: DINOv3. We've all seen the beautiful PCA reductions of the features it can produce, and we've all seen the amazing cosine similarities between the different fruit. But maybe the most exciting, was their release of ConvNeXt student models. As I was reading it, I wondered how their feature maps look like, and whether they would be anything like the ViT features. So yeah.. let's have a look together!
PCA Reductions
All the following feature maps will be produced from the ConvNext-Tiny model. Both the DINOv3 as well as the imagenet version are from timm.
First some within-image PCA reductions. Let's first look at them in the RGB way. The process to do so is:
- Run the image through DINOv3 ConvNeXt-Tiny, get the last feature map ("C5" in FPN terms, 1/32 stride)
- Flatten the image, so we get an NxD matrix, i.e. a list of the feature map pixels
- Compute the PCA, reducing N to 3
- Projecting the resulting features onto these 3 principal components (PC)
- Rescaling using min/max normalization per channel
- Display the three dims as RGB
Let's start with this guy

Within-Image Variance
Running it through Convnext-Tiny-DINOv3 and Convnext-Tiny-Imagenet1k gives us:
Okay.. actually crazy. For some reason I wasn't expecting the ConvNext versions to produce anything close to the ViTs. But here we go.. Holy crap.
Now, considering that the PCA projection is a linear operation, this effectively means that the features themselves are only a single 1x1 convolution apart from being a segmentation mask. How incredible is that?
The imagenet features seem quite messy, with lots of outlier feature map pixels. I remember when it was common wisdom that deep network feature maps are not interpretable. And looking at the Imagenet feature map, that surely is the case.
But where we are now with DINO, that doesn't seem to hold true anymore. The feature map pixels are interpretable, maybe not directly the 768 dimensional raw features, but a simple linear transformation is enough to make it so.
Now instead of overlaying the PC projections, let's look at them individually.
Seeing them individually really reinforces the interpretability point. First principal component is objectness, or rather "subject"ness. So much so, that the authors of the DINOv2 paper actually used the first principal component for masking the images, before computing across-image PCA maps. For example for their headliner Figure 1.
And luckily this seems to hold for both, the ViT as well as the ConvNeXt distillation. The following principal components start differing, but still refer to interpretable concepts. For the ConvNeXt distillation, these could be described as inside/outside or core/periphery, centerness, left/right basis function, almost like a fourier decomposition.
Meanwhile the imagenet principal components? No idea.. Every now and then, there are hints for something spatial, but mostly its just garbled. Makes sense, if it was trained to predict a global measure (class), why would it learn something spatial at all?
Looking At The Different Backbone Stages
Now, for ViTs, they aggregate global information at every stage in their training. There is nothing hierarchical about their feature assembly. But some applications do need hierarchical features, think FPN.
Now: If a Convnext student is taught by a ViT teacher, what will its intermediate feature maps look like? Surely they do contain the same kinds of hierarchical info as an Imagenet-pretrained backbone. Right?
Right, it seems.. We get the classical edge extraction features in the early layers, followed by more and more complex concepts. In the C4 stage of DINO, you can already the beginnings of instance segmentation occurring.
As for the imagenet pretrained convnext, we see a progressive loss of spatial information. We begin with edges, get to something that somewhat resembles segmentation, albeit noisy, and finally ends in garble. The ViT, unsurprisingly, contains global info from the beginning (and the same resolution), but gets cleaner as we go deeper into the network.
To savour this a bit more, let's take a look at five random feature maps from either backbone, per layer.
Across-Image Variance
What we have seen so far has been within-image PCA. That is to say, we visualize the most prominent "things"/concepts as it relates to a single image. These principal components will be different for every single image and subject it portrays. But we know, from the DINO paper, that the features are general. That is to say that the features contain information that translates across animal species for example. I.e. a foot is a foot, whether it belongs to a horse or a dog.
Let's first try to reproduce the dog images using the ViT version.
Okay.. so this is not exactly as expected. Across the animals, it's quite difficult to distinguish common features. Rather we can tell ground from sky. Why?
The reason is, that we haven't masked the subjects. Since a lot of the pixels come from the background, the principal components will reflect that. We lose "contrast" within the animals. So, let's try again. This time, we'll take the first within-image principal component to mask the subject, then collect all subject-pixels, compute their principal components, and project the features onto that.
And there we have it, okay so that did work. The snout and head are purple reddish, and progressively turn yellow, then green along the animal. Finally the legs are purple in all cases. Well, almost..
The pug doesn't share the same features exactly. Why? If we take a closer look, we notice that the face and legs are fine, but the torso is not. And if we look at the source image we can see why: The dog is wearing a sweatshirt. So not only do the feature represent the position along the body anatomy, but also whether we are actually seeing the body, and not some covering. Great!
But so far, all of this was the ViT. The reason I'm writing this blog post is because I wanted to look at the ConvNeXt distillations... So now let's do the same thing again.
Hm.. So yeah, this is ConvNeXt-Tiny. And we can already tell, that the masks are less good. This means that the initial within-image principal component was not as clean (or the threshold suboptimal). But even after that, the features are definitely much less clear. It's a bit difficult to tell from the RGB image, so let's turn to the individual PCA projections again.
It makes it a bit clearer, i.e. the first PC seems to represent something like "eye", and the third something like ears or snout. A bit more muddied than the ViT, that is for sure. But still.. if this is what we get from the backbone, without barely any processing, that is a win in my book.
Now let's turn to look at something different.
Feature Statistics
An important but easily overlooked aspect: How do the feature distributions look like? I.e. what magnitude do DINO features have, vs imagenet features? Are they centered they same? I used to think that because of Layernorm, the activations would all have to be roughly the same, across model weights, but boy was I wrong..
If we aggregate all features across all channels, we get the following picture:
The distribution is completely different. While they both look somewhat the same, i.e. zero-centered, when looking at the raw histogram, the automatic x-axis scaling looks weird. If we log the Y-axis we can see why.
The imagenet features actually contain huge amounts of outliers, in a very wild multi-modal distribution, while the DINO features are much more centered around zero. We can also see that both distributions become more "usual looking", the further we progress in the network.
We can also see this a bit clearer if we zoom into the mass of the distribution, by cropping the data at the first and 99th percentile.
Here we are looking at 98% of the features, and the same picture emerges: The Imagenet features are much much wider, i.e. have larger variance. At least initially in the C3 features. But as we progress through the network, both networks distributions become more similar, approaching something like a Laplace distribution.
To complete the picture, let's look at a randomly picked channel and its statistics.
One interesting observation about this, which comes as no surprise to stats-literate people, the distributions start out similar, when there are still many samples (high res stages C3 and C4), but do become quite different, as the number of samples drop. For this particular channel, the DINO features are actually more heavy tailed than the Imagenet features. But such are statistics, individual channels will always behave differently than the aggregation.
Resume
So there we have it. We found out, that the DINOv3 ConvNeXt features are great, and absolutely of a different caliber than their imagenet counterparts. We also saw that they seem to be a bit less clean, both spatially as well as semantically than the ViT features. And finally, we saw that their feature distributions are a bit tighter than imagenet, at least in the early layers.
As a sendoff, here are a few more pretty images (this time from ViT and high-res, because they are a bit crisper).
Cheers,
max
