SAE Feature Browser

Sparse Autoencoder — discovered features from model activations

Mechanistic Interpretability via Dictionary Learning. Features below were extracted by training a Sparse Autoencoder on the model's internal activation vectors. Each feature represents a learned direction in activation space — a concept the model uses internally. Click any feature to see which inputs activate it most strongly.