Sparse Autoencoder
What I thought before reading anything
Autoencoding is about having a neural network find an alternate representation of the data by having it search for a set of weights that can convert the data into the alternate representation and also be able to reconstruct the original version from that alternate representation.
Usually the alternate version is smaller than the original, so autoencoding is about compression.
But, in case of network interpretability, the alternate version is larger than the original.
By having a larger alternate, anthropic hypothesizes, the polysemanticity of the activations from the original get disentangled.
Now what I don’t get is:
- Why does this kind of autoencoding do anything?
Checking Cunningham et al “Sparse Autoencoders Find Highly Interpretable Features in Language Models”
I agree this hints at the inherent interpretability of that feature, but this is not clean.
So what would I want out of interpretability? …