Topological Integration in Attention-Based Neural Networks

Cluster Collapse, Counterexamples, and the Role of Direct Interaction

Adam — Revelry Inc. Pre-print draft

Abstract

We use persistent homology to measure the topological structure of hidden representations across layers in three architecturally distinct neural networks: a 4B-parameter dense transformer (Qwen3-4B), a 328M-parameter recursive transformer (NanoChat), and a 370M-parameter state-space model (Mamba). While prior work has characterized the intrinsic dimensionality (ID) profile across layers, persistent homology reveals a qualitatively distinct phenomenon: in attention-based architectures, independent representational clusters collapse to a single connected component at intermediate layers (persistent b0: 519→1 in Qwen, 517→1 in NanoChat), then re-differentiate at the output. This cluster collapse is absent in the state-space model (Mamba b0: 551→987, no collapse) and in untrained transformers (b0: 503→968, clusters proliferate; confirmed across 8 random seeds). A sensitivity analysis across 36 parameter combinations and 100 bootstrap iterations confirms the robustness of these findings. We propose that topological integration requires two conditions: (1) an architectural mechanism for direct interaction between representations (attention), and (2) gradient-based optimization to learn the integrated configuration.

1. Introduction

The geometric properties of neural network representations have been studied through intrinsic dimensionality (ID) estimation, revealing a characteristic "hunchback" profile across layers — ID increases in early layers and decreases toward the output [Ansuini et al., 2019]. This pattern has been confirmed in large transformer models [Valeriani et al., 2023] and connected to the Information Bottleneck framework [Tishby et al., 2000], though the relationship between ID compression and generalization remains debated.

We extend this line of work by applying topological analysis — specifically, persistent homology — to neural network representations. Where ID measures the local dimensionality of the data manifold, persistent homology captures global structural properties: how many independent components exist (b0), how many loops or cycles are present (b1), and how these features persist across spatial scales.

Our central finding is a cluster collapse — the number of persistent connected components drops from hundreds to one at intermediate layers in attention-based architectures. This is a qualitatively different observation from the ID hunchback: while ID measures manifold complexity, cluster collapse measures manifold connectivity. A representation can have high intrinsic dimensionality (complex) while being topologically unified (one connected piece).

Critically, this collapse is absent in a state-space model (Mamba), which processes tokens sequentially through a compressed state rather than through direct pairwise interaction (attention). This architectural counterexample suggests that topological integration requires a mechanism for direct interaction between representations — an observation we connect, with appropriate caveats, to Integrated Information Theory [Tononi, 2004] and computational symbiogenesis [Agüera y Arcas et al., 2024].

3. Methods

3.1 Models

We analyze three architecturally distinct models:

  • Qwen3-4B: 36-layer dense transformer, dmodel=2560, 4B parameters, SiLU activation. All parameters active for every token.
  • NanoChat Recursive: 328M-parameter GPT-2 style model, dmodel=1280, ReLU² activation. Architecture: 2 prelude → 4 recurrent (looped 4×) → 2 coda layers.
  • Mamba-370m: 48-layer selective state-space model, dmodel=1024, 370M parameters. Processes tokens through selective state updates with no attention mechanism.

3.2 Activation Capture

For Qwen3-4B, we captured residual stream activations at 7 layers (L3, L5, L6, L8, L16, L24, L35) across 3.78M tokens. For NanoChat, we captured 9 stages (embed, P0, P1, R0-R3, C0, C1) across 49,664 tokens from WikiText-2. For Mamba, we captured 7 layers (L0, L8, L16, L24, L32, L40, L47) across 40,960 tokens from WikiText-2.

3.3 Two-NN Intrinsic Dimensionality

We estimate ID using the Two-NN estimator [Facco et al., 2017]: ID = n / Σ log(μi), where μi = d2(xi) / d1(xi) is the ratio of second-to-first nearest neighbor distances. We subsample 10,000 tokens with 100 bootstrap iterations for confidence intervals.

3.4 Persistent Homology

We compute persistent homology using ripser. Default parameters: 1,000 landmark points (random selection), PCA projection to 50 dimensions, Vietoris-Rips filtration with Euclidean distance, maximum dimension 1. We report persistent Betti numbers — features with lifetime exceeding 10% of the maximum lifetime at that dimension.

Sensitivity analysis: We sweep across 36 parameter combinations: landmarks ∈ {500, 1000, 2000}, PCA dimensions ∈ {30, 50, 100, None (raw 2560-dim)}, persistence threshold ∈ {5%, 10%, 20%}.

Bootstrap confidence intervals: We run 100 bootstrap iterations (different random landmark selections) at each layer to establish statistical reliability.

3.5 Controls

  1. Untrained model: Randomly initialized NanoChat (same architecture, no training), repeated across 8 different random seeds.
  2. Architectural counterexample: Mamba-370m (trained model, no attention mechanism).

4. Results

4.1 Intrinsic Dimensionality

All three models show distinct ID profiles:

Qwen3-4B (transformer): Hunchback peaking at L16 (ID=9.8), consistent with prior work.

NanoChat (recursive transformer): ID peaks during recursion (R1=12.1), with a secondary rise at the coda (C1=13.9).

Mamba (SSM): ID increases monotonically from 2.6 (L0) to 8.4 (L32), then plateaus. No hunchback.

4.2 Cluster Collapse in Attention-Based Models

Persistent b0 (connected components) reveals a phase transition in attention-based architectures:

ModelEarlyMidLate
Qwen3-4B519 (L3)1 (L16)715 (L35)
NanoChat517 (embed)1 (C1)
Mamba571 (L0)987 (L32)947 (L47)

In both transformers, representations collapse from hundreds of independent clusters to a single connected component. In Mamba, no collapse occurs — clusters increase through the network.

4.3 Loop Formation

Persistent b1 (1-cycles) peaks at or near the integration layer in transformers:

ModelEarlyPeakLate
Qwen3-4B207 (L3)342 (L16)145 (L35)
NanoChat425 (embed)400 (R3)156 (C1)
Mamba112 (L0)325 (L40)264 (L47)

Mamba does develop loop structure, suggesting some internal organization emerges from training even without attention. However, this occurs without the accompanying cluster collapse — the loops form within a fragmented, multi-component manifold rather than a unified one.

4.4 Robustness

Sensitivity analysis (36 parameter combinations on Qwen3-4B L16):

  • Cluster collapse (p_b0 ≤ 5) in 30/36 combinations (83%)
  • All 6 failures involve 500 landmarks (insufficient sampling)
  • With ≥1000 landmarks: collapse in 23/24 combinations (96%)
  • Without PCA (raw 2560 dimensions): p_b0 = 2, confirming the collapse is not a PCA projection artifact

Bootstrap confidence intervals (100 iterations, Qwen3-4B):

Layerp_b0 ≤ 5p_b0 > 500Distribution
L30%79%Always fragmented
L657%43%Transitional
L1687%12%Mostly collapsed
L2490%10%Mostly collapsed
L350%37%Re-differentiated

The bimodal distribution (either ≤5 or >800, rarely between) reflects landmark sampling sensitivity rather than ambiguity in the underlying topology.

Multi-seed control (8 random seeds, untrained NanoChat):

  • Clusters proliferate in 8/8 seeds (embed ~297 → C1 ~486)
  • ID flat at ~3.3 across all seeds (std=0.1)
  • The untrained pattern is deterministic across initializations

4.5 Emergent Bimodal Gating (Qwen3-4B)

We independently discovered an emergent processing gate at L3 in Qwen3-4B that routes 93% of tokens into a shallow path (Mode A) and 7% into a deep path (Mode B). Causal ablation confirms the gate is functional. Mode B tokens show consistently lower ID (~2.5 dims below Mode A) and at L6 form only 4 persistent clusters (PCA explained variance 99.9%), compared to 811 clusters for Mode A. This suggests the model develops specialized processing pathways within the integrated manifold.

5. Discussion

5.1 Two Conditions for Topological Integration

Our results suggest that cluster collapse requires two conditions:

1. Architectural mechanism for direct interaction. Attention provides pairwise interaction between all token representations, enabling them to merge into a unified manifold. Mamba's sequential state updates compress information through a bottleneck but do not enable direct interaction between representations. The absence of collapse in Mamba despite successful training suggests that optimization alone is insufficient.

2. Gradient-based optimization. Untrained transformers have the architectural capacity for integration (attention) but show the opposite pattern — clusters proliferate. Training finds the integrated configuration. This is confirmed across 8 random seeds.

Neither condition alone is sufficient. Both are necessary.

5.2 Structural Analogies

We note structural analogies to several theoretical frameworks, while acknowledging these remain analogies rather than formal equivalences:

Integrated Information Theory. Tononi's Φ measures how much a system is "more than the sum of its parts" through causal irreducibility. Our cluster collapse (b0 → 1) measures topological irreducibility. These are different mathematical objects — Φ is defined over state-transition mechanisms while b0 characterizes a static point cloud. However, the architectural dependence we observe is consistent with IIT's emphasis on causal interaction structure.

Computational symbiogenesis. Agüera y Arcas et al. [2024] demonstrate that self-replicating programs emerge from random computation through the merger of independent computational units. Our cluster collapse follows the same structural pattern. The Mamba counterexample reinforces this — without direct interaction, fusion cannot occur regardless of optimization pressure.

Free Energy Principle. The observation that optimization drives systems toward integrated configurations is consistent with free energy minimization, where the unified manifold represents a lower-energy state. However, we have not formally established this connection.

5.3 Limitations

  1. Scale of study. We analyze three models. Testing across more model families (encoder-decoders, vision transformers, mixture-of-experts) is needed.
  2. Bootstrap variance. The bimodal bootstrap distribution at L16 (87% collapse, 12% no collapse) reflects sensitivity to landmark sampling. Improved selection strategies may reduce this variance.
  3. Causal claims. We observe correlations between architecture and topology but have not established causal mechanisms.
  4. Theoretical connections. Our analogies to IIT, FEP, and symbiogenesis are structural, not formal.
  5. Training dynamics. We compare trained and untrained endpoints but do not measure topology during training.
  6. Token autocorrelation. Tokens within sequences are not independent. Our effective sample size is smaller than the raw token count.

6. Conclusion

Persistent homology reveals a topological phase transition in attention-based neural networks — cluster collapse from hundreds of independent components to a single unified manifold — that is absent in state-space models and untrained networks. This suggests that topological integration requires both an architectural mechanism for direct interaction (attention) and gradient-based optimization.

We propose a three-condition framework: integration emerges when (1) representations can directly interact, (2) optimization pressure drives the system toward a unified state, and (3) both conditions are met simultaneously. This framework is falsifiable — it predicts that any architecture enabling direct pairwise interaction between representations will develop cluster collapse when trained, and any architecture lacking this mechanism will not.

We hope this work contributes to a deeper understanding of how neural networks organize computation, and encourages further investigation at the intersection of algebraic topology, representation geometry, and computational neuroscience.

References

  • Agüera y Arcas, B., et al. (2024). Computational Life: How Well-formed, Self-replicating Programs Emerge from Simple Interaction. arXiv:2406.19108.
  • Ansuini, A., et al. (2019). Intrinsic dimension of data representations in deep neural networks. NeurIPS.
  • Facco, E., et al. (2017). Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports.
  • Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience.
  • Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
  • Kornblith, S., et al. (2019). Similarity of Neural Network Representations Revisited. ICML.
  • Marks, S. & Tegmark, M. (2023). The Geometry of Truth. arXiv:2310.06824.
  • Papyan, V., et al. (2020). Prevalence of Neural Collapse during the terminal phase of deep learning training. PNAS.
  • Rovelli, C. (2015). Seven Brief Lessons on Physics. Riverhead Books.
  • Tishby, N., et al. (2000). The information bottleneck method. Proceedings of the 37th Allerton Conference.
  • Tononi, G. (2004). An information integration theory of consciousness. BMC Neuroscience.
  • Tralie, C., et al. (2018). Ripser.py: A Lean Persistent Homology Library for Python. JOSS.
  • Valeriani, L., et al. (2023). The geometry of hidden representations of large transformer models. NeurIPS.