Topology of Thought

Abstract

We use persistent homology to measure the topological structure of hidden representations across layers in three architecturally distinct neural networks: a 4B-parameter dense transformer (Qwen3-4B), a 328M-parameter recursive transformer (NanoChat), and a 370M-parameter state-space model (Mamba). While prior work has characterized the intrinsic dimensionality (ID) profile across layers, persistent homology reveals a qualitatively distinct phenomenon: in attention-based architectures, independent representational clusters collapse to a single connected component at intermediate layers (persistent b₀: 519→1 in Qwen, 517→1 in NanoChat), then re-differentiate at the output. This cluster collapse is absent in the state-space model (Mamba b₀: 551→987, no collapse) and in untrained transformers (b₀: 503→968, clusters proliferate; confirmed across 8 random seeds). A sensitivity analysis across 36 parameter combinations and 100 bootstrap iterations confirms the robustness of these findings. We propose that topological integration requires two conditions: (1) an architectural mechanism for direct interaction between representations (attention), and (2) gradient-based optimization to learn the integrated configuration.

1. Introduction

The geometric properties of neural network representations have been studied through intrinsic dimensionality (ID) estimation, revealing a characteristic "hunchback" profile across layers — ID increases in early layers and decreases toward the output [Ansuini et al., 2019]. This pattern has been confirmed in large transformer models [Valeriani et al., 2023] and connected to the Information Bottleneck framework [Tishby et al., 2000], though the relationship between ID compression and generalization remains debated.

We extend this line of work by applying topological analysis — specifically, persistent homology — to neural network representations. Where ID measures the local dimensionality of the data manifold, persistent homology captures global structural properties: how many independent components exist (b₀), how many loops or cycles are present (b₁), and how these features persist across spatial scales.

Our central finding is a cluster collapse — the number of persistent connected components drops from hundreds to one at intermediate layers in attention-based architectures. This is a qualitatively different observation from the ID hunchback: while ID measures manifold complexity, cluster collapse measures manifold connectivity. A representation can have high intrinsic dimensionality (complex) while being topologically unified (one connected piece).

Critically, this collapse is absent in a state-space model (Mamba), which processes tokens sequentially through a compressed state rather than through direct pairwise interaction (attention). This architectural counterexample suggests that topological integration requires a mechanism for direct interaction between representations — an observation we connect, with appropriate caveats, to Integrated Information Theory [Tononi, 2004] and computational symbiogenesis [Agüera y Arcas et al., 2024].

2. Related Work

Intrinsic dimensionality in neural networks

Ansuini et al. [2019] first characterized the layer-wise ID profile using the Two-NN estimator [Facco et al., 2017], identifying the "hunchback" shape in CNNs (VGG, ResNet). They showed this profile is absent in untrained networks and networks trained on random labels. Valeriani et al. [2023] extended this to large transformers (ESM-2, iGPT, LLaMA-2), finding similar expansion-contraction patterns and showing that ID minima correspond to layers with the richest semantic content.

Neural Collapse

Papyan et al. [2020] identified a terminal phase in classification training where last-layer features collapse to class means forming a simplex ETF. Our observation of re-differentiation at output layers may be structurally related, though our setting (autoregressive language modeling) differs from classification.

Topology in neural networks

Persistent homology has been applied to analyze loss landscapes, decision boundaries, and training dynamics. Our work applies it to representations across layers during inference, measuring how topological structure evolves through the network.

State-space models

Mamba [Gu & Dao, 2023] is a selective state-space model that achieves competitive performance with transformers while avoiding quadratic attention complexity. Its sequential processing architecture provides a natural counterexample for studying the role of direct interaction in representation geometry.

3. Methods

3.1 Models

We analyze three architecturally distinct models:

Qwen3-4B: 36-layer dense transformer, d_model=2560, 4B parameters, SiLU activation. All parameters active for every token.
NanoChat Recursive: 328M-parameter GPT-2 style model, d_model=1280, ReLU² activation. Architecture: 2 prelude → 4 recurrent (looped 4×) → 2 coda layers.
Mamba-370m: 48-layer selective state-space model, d_model=1024, 370M parameters. Processes tokens through selective state updates with no attention mechanism.

3.2 Activation Capture

For Qwen3-4B, we captured residual stream activations at 7 layers (L3, L5, L6, L8, L16, L24, L35) across 3.78M tokens. For NanoChat, we captured 9 stages (embed, P0, P1, R0-R3, C0, C1) across 49,664 tokens from WikiText-2. For Mamba, we captured 7 layers (L0, L8, L16, L24, L32, L40, L47) across 40,960 tokens from WikiText-2.

3.3 Two-NN Intrinsic Dimensionality

We estimate ID using the Two-NN estimator [Facco et al., 2017]: ID = n / Σ log(μ_i), where μ_i = d₂(x_i) / d₁(x_i) is the ratio of second-to-first nearest neighbor distances. We subsample 10,000 tokens with 100 bootstrap iterations for confidence intervals.

3.4 Persistent Homology

We compute persistent homology using ripser. Default parameters: 1,000 landmark points (random selection), PCA projection to 50 dimensions, Vietoris-Rips filtration with Euclidean distance, maximum dimension 1. We report persistent Betti numbers — features with lifetime exceeding 10% of the maximum lifetime at that dimension.

Sensitivity analysis: We sweep across 36 parameter combinations: landmarks ∈ {500, 1000, 2000}, PCA dimensions ∈ {30, 50, 100, None (raw 2560-dim)}, persistence threshold ∈ {5%, 10%, 20%}.

Bootstrap confidence intervals: We run 100 bootstrap iterations (different random landmark selections) at each layer to establish statistical reliability.

3.5 Controls

Untrained model: Randomly initialized NanoChat (same architecture, no training), repeated across 8 different random seeds.
Architectural counterexample: Mamba-370m (trained model, no attention mechanism).

4. Results

4.1 Intrinsic Dimensionality

All three models show distinct ID profiles:

Qwen3-4B (transformer): Hunchback peaking at L16 (ID=9.8), consistent with prior work.

NanoChat (recursive transformer): ID peaks during recursion (R1=12.1), with a secondary rise at the coda (C1=13.9).

Mamba (SSM): ID increases monotonically from 2.6 (L0) to 8.4 (L32), then plateaus. No hunchback.

4.2 Cluster Collapse in Attention-Based Models

Persistent b₀ (connected components) reveals a phase transition in attention-based architectures:

Model	Early	Mid	Late
Qwen3-4B	519 (L3)	1 (L16)	715 (L35)
NanoChat	517 (embed)	1 (C1)	—
Mamba	571 (L0)	987 (L32)	947 (L47)

In both transformers, representations collapse from hundreds of independent clusters to a single connected component. In Mamba, no collapse occurs — clusters increase through the network.

4.3 Loop Formation

Persistent b₁ (1-cycles) peaks at or near the integration layer in transformers:

Model	Early	Peak	Late
Qwen3-4B	207 (L3)	342 (L16)	145 (L35)
NanoChat	425 (embed)	400 (R3)	156 (C1)
Mamba	112 (L0)	325 (L40)	264 (L47)

Mamba does develop loop structure, suggesting some internal organization emerges from training even without attention. However, this occurs without the accompanying cluster collapse — the loops form within a fragmented, multi-component manifold rather than a unified one.

4.4 Robustness

Sensitivity analysis (36 parameter combinations on Qwen3-4B L16):

Cluster collapse (p_b₀ ≤ 5) in 30/36 combinations (83%)
All 6 failures involve 500 landmarks (insufficient sampling)
With ≥1000 landmarks: collapse in 23/24 combinations (96%)
Without PCA (raw 2560 dimensions): p_b₀ = 2, confirming the collapse is not a PCA projection artifact

Bootstrap confidence intervals (100 iterations, Qwen3-4B):

Layer	p_b₀ ≤ 5	p_b₀ > 500	Distribution
L3	0%	79%	Always fragmented
L6	57%	43%	Transitional
L16	87%	12%	Mostly collapsed
L24	90%	10%	Mostly collapsed
L35	0%	37%	Re-differentiated

The bimodal distribution (either ≤5 or >800, rarely between) reflects landmark sampling sensitivity rather than ambiguity in the underlying topology.

Multi-seed control (8 random seeds, untrained NanoChat):

Clusters proliferate in 8/8 seeds (embed ~297 → C1 ~486)
ID flat at ~3.3 across all seeds (std=0.1)
The untrained pattern is deterministic across initializations

4.5 Emergent Bimodal Gating (Qwen3-4B)

We independently discovered an emergent processing gate at L3 in Qwen3-4B that routes 93% of tokens into a shallow path (Mode A) and 7% into a deep path (Mode B). Causal ablation confirms the gate is functional. Mode B tokens show consistently lower ID (~2.5 dims below Mode A) and at L6 form only 4 persistent clusters (PCA explained variance 99.9%), compared to 811 clusters for Mode A. This suggests the model develops specialized processing pathways within the integrated manifold.

5. Discussion

5.1 Two Conditions for Topological Integration

Our results suggest that cluster collapse requires two conditions:

1. Architectural mechanism for direct interaction. Attention provides pairwise interaction between all token representations, enabling them to merge into a unified manifold. Mamba's sequential state updates compress information through a bottleneck but do not enable direct interaction between representations. The absence of collapse in Mamba despite successful training suggests that optimization alone is insufficient.

2. Gradient-based optimization. Untrained transformers have the architectural capacity for integration (attention) but show the opposite pattern — clusters proliferate. Training finds the integrated configuration. This is confirmed across 8 random seeds.

Neither condition alone is sufficient. Both are necessary.

5.2 Structural Analogies

We note structural analogies to several theoretical frameworks, while acknowledging these remain analogies rather than formal equivalences:

Integrated Information Theory. Tononi's Φ measures how much a system is "more than the sum of its parts" through causal irreducibility. Our cluster collapse (b₀ → 1) measures topological irreducibility. These are different mathematical objects — Φ is defined over state-transition mechanisms while b₀ characterizes a static point cloud. However, the architectural dependence we observe is consistent with IIT's emphasis on causal interaction structure.

Computational symbiogenesis. Agüera y Arcas et al. [2024] demonstrate that self-replicating programs emerge from random computation through the merger of independent computational units. Our cluster collapse follows the same structural pattern. The Mamba counterexample reinforces this — without direct interaction, fusion cannot occur regardless of optimization pressure.

Free Energy Principle. The observation that optimization drives systems toward integrated configurations is consistent with free energy minimization, where the unified manifold represents a lower-energy state. However, we have not formally established this connection.

5.3 Limitations

Scale of study. We analyze three models. Testing across more model families (encoder-decoders, vision transformers, mixture-of-experts) is needed.
Bootstrap variance. The bimodal bootstrap distribution at L16 (87% collapse, 12% no collapse) reflects sensitivity to landmark sampling. Improved selection strategies may reduce this variance.
Causal claims. We observe correlations between architecture and topology but have not established causal mechanisms.
Theoretical connections. Our analogies to IIT, FEP, and symbiogenesis are structural, not formal.
Training dynamics. We compare trained and untrained endpoints but do not measure topology during training.
Token autocorrelation. Tokens within sequences are not independent. Our effective sample size is smaller than the raw token count.

6. Conclusion

Persistent homology reveals a topological phase transition in attention-based neural networks — cluster collapse from hundreds of independent components to a single unified manifold — that is absent in state-space models and untrained networks. This suggests that topological integration requires both an architectural mechanism for direct interaction (attention) and gradient-based optimization.

We propose a three-condition framework: integration emerges when (1) representations can directly interact, (2) optimization pressure drives the system toward a unified state, and (3) both conditions are met simultaneously. This framework is falsifiable — it predicts that any architecture enabling direct pairwise interaction between representations will develop cluster collapse when trained, and any architecture lacking this mechanism will not.

We hope this work contributes to a deeper understanding of how neural networks organize computation, and encourages further investigation at the intersection of algebraic topology, representation geometry, and computational neuroscience.

References

Agüera y Arcas, B., et al. (2024). Computational Life: How Well-formed, Self-replicating Programs Emerge from Simple Interaction. arXiv:2406.19108.
Ansuini, A., et al. (2019). Intrinsic dimension of data representations in deep neural networks. NeurIPS.
Facco, E., et al. (2017). Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports.
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience.
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
Kornblith, S., et al. (2019). Similarity of Neural Network Representations Revisited. ICML.
Marks, S. & Tegmark, M. (2023). The Geometry of Truth. arXiv:2310.06824.
Papyan, V., et al. (2020). Prevalence of Neural Collapse during the terminal phase of deep learning training. PNAS.
Rovelli, C. (2015). Seven Brief Lessons on Physics. Riverhead Books.
Tishby, N., et al. (2000). The information bottleneck method. Proceedings of the 37th Allerton Conference.
Tononi, G. (2004). An information integration theory of consciousness. BMC Neuroscience.
Tralie, C., et al. (2018). Ripser.py: A Lean Persistent Homology Library for Python. JOSS.
Valeriani, L., et al. (2023). The geometry of hidden representations of large transformer models. NeurIPS.