FMS + G-SAE

Abstract

There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.

Feature Monosemanticity Score (FMS)

Goal. Quantify whether a single latent unit cleanly corresponds to a concept.

Components.

Feature capacity — accuracy of the best single latent for the concept.
Local disentanglement — remove that latent and re-evaluate; monosemantic concepts should drop toward chance.
Global disentanglement — track marginal gains when adding more latents; truly monosemantic concepts should not need backup.

Procedure (decision-tree based).

Train a shallow decision tree on SAE latents to localize the most informative latent (root node).
Record accuracy as you include top-k features along the tree path; compute marginal gains (global).
Retrain while excluding the top feature(s) to measure the drop (local).

Local disentanglement.

\[ \mathrm{FMS}_{\text{local}@p} \;=\; 2 \times \big(\,\mathrm{accs}_{0} - \mathrm{accs}_{p}\,\big) \]

Global disentanglement.

\[ A(n) \;=\; \sum_{i=1}^{n}\big(\mathrm{accs\_cum}_{i} - \mathrm{accs}_{0}\big), \qquad \mathrm{FMS}_{\text{global}} \;=\; 1 - \frac{A(n)}{n} \]

Overall score.

\[ \mathrm{FMS}@p \;=\; \frac{1}{|C|}\sum_{i=1}^{|C|} \mathrm{accs}^{c_i}_{0}\;\times\; \frac{\mathrm{FMS}^{c_i}_{\text{local}@p} + \mathrm{FMS}^{c_i}_{\text{global}}}{2} \]

Features are ranked by a Gini tree; the root gives accs₀, the path gives accs_cum, and iterative retraining with roots removed estimates locality.

Guided Sparse Autoencoders (G‑SAE)

Idea. Reserve a small set of latent indices for labeled concepts and condition them during training so each index becomes monosemantic by design.

Encoder activations: Sigmoid(Top‑K) to obtain sparse, interpretable [0,1] latents.
Conditioning loss: Binary cross-entropy on reserved indices; if concept \(c\) is present, drive latent \(f_{j(c)}\) toward 1, else toward 0.
Detection: inspect index \(j(c)\) directly at inference.
Steering: use the decoder column \(D_i\) as a steering vector; modify the residual stream \(\hat{\mathbf x} = \mathbf x + \alpha \times \sum\nolimits_{i=0}^c \left(\beta_i \times \gamma_i \times D_{\cdot,i}\right)\;\) with steering strength \(\alpha\), normalization factor \(\beta\), and balancing term \(\gamma\).

Architecture.

\[ \operatorname{SAE}(x) = D(\sigma(E(x))),\quad E(x) = W_{\text{enc}}x + b_{\text{enc}} = h,\quad D(f) = W_{\text{dec}}f + b_{\text{dec}} = \hat x,\quad \sigma(h) = \mathrm{Sigmoid}(\mathrm{TopK}(h)) = f \]

Losses. Normalized MSE reconstruction and BCE conditioning on a reserved block \( f[0{:}c]=(f_0,\dots,f_c) \):

\[ \mathcal{L}_r = \frac{\lVert \hat x - x \rVert^2}{\lVert x \rVert^2}, \qquad \mathcal{L}_c = \mathrm{BCE}(f[0{:}c], y) = -\frac{1}{c+1}\sum_{i=0}^{c}\big(y_i \log f_i + (1-y_i)\log(1-f_i)\big), \qquad \mathcal{L}_{\text{total}} = \mathcal{L}_r + \mathcal{L}_c \]

Detection & Steering. For concept \(i\), decoder column \(D_{\cdot,i}\in\mathbb{R}^d\) is the steering direction. Normalize and combine with

\[ \beta_i = \frac{\lVert x \rVert_2}{\lVert D_{\cdot,i} \rVert_2}, \qquad \gamma_i \in \{1,\, f_i,\, 1-f_i\}, \qquad \hat x = x + \alpha \sum_{i=0}^{c} \big( \beta_i\, \gamma_i\, D_{\cdot,i} \big) \]

Results

Monosemanticity (FMS)

Task	Vanilla SAE	G‑SAE
Average FMS@1	0.27	0.52
Privacy	0.28	0.62
Shakespeare	0.28	0.57
Toxicity	0.26	0.37

Steering Performance (SuccessRate)

Dataset	Vanilla SAE	G‑SAE (Ours)
Toxicity	0.95	0.98
Shakespeare	0.64	0.72
Mixed (T & S)	0.80	0.82
Privacy (multi‑concept)	0.47	0.53

LLM judges indicate no notable degradation in grammar/coherence.

Downloads & Examples

⬇️ Download FMS_Tree_SVM.py

Minimal working FMS example


from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np


# ==== MONOSEMANTICITY WITH TREE ====
# (Original prototype function)
def monosemmetric_tree(X_train, X_test, y_train, y_test):
    # 1. Feature capacity from root node
    tree = DecisionTreeClassifier(max_depth=1)
    tree.fit(X_train, y_train)
    root_feature = tree.tree_.feature[0]
    accs_0 = accuracy_score(y_test, tree.predict(X_test))


    # 2. Local disentanglement
    X_train_local = np.delete(X_train, root_feature, axis=1)
    X_test_local = np.delete(X_test, root_feature, axis=1)
    tree_local = DecisionTreeClassifier(max_depth=1)
    tree_local.fit(X_train_local, y_train)
    accs_p = accuracy_score(y_test, tree_local.predict(X_test_local))
    mono_local = 2 * (accs_0 - accs_p)
    mono_local = np.clip(mono_local, 0, 1)


    # 3. Global disentanglement: increasing depth
    accs_cum = []
    for d in range(1, X_train.shape[1] + 1):
        tree = DecisionTreeClassifier(max_depth=d)
        tree.fit(X_train, y_train)
        accs_cum.append(accuracy_score(y_test, tree.predict(X_test)))
        if accs_cum[-1] >= 1 - 1e-3: # early stopping
            break


    A_n = sum(acc - accs_0 for acc in accs_cum)
    mono_global = 1 - A_n / len(accs_cum)
    mono_global = np.clip(mono_global, 0, 1)


    # 4. Final score
    mono_score = accs_0 * (mono_local + mono_global) / 2
    return {
        "accs_0": accs_0,
        "local": mono_local,
        "global": mono_global,
        "monosemmetric": mono_score,
        }

BibTeX

@inproceedings{harle2025monosemanticity,
            title = {Measuring and Guiding Monosemanticity},
            author = {Ruben H{\"a}rle and Felix Friedrich and Manuel Brack and Stephan W{\"a}ldchen and Bj{\"o}rn Deiseroth and
            Patrick Schramowski and Kristian Kersting},
            booktitle = {Advances in Neural Information Processing Systems},
            year = {2025},
            note = {Spotlight}
            }

Measuring and Guiding Monosemanticity

Warning: