Kolmogorov Complexity Bounds for LLM Steganography and a Perplexity-Based Detection Proxy
This paper establishes information-theoretic limits on LLM steganography, proving that any semantic-preserving embedding of a payload $P$ into a covertext $M_1$ to produce stegotext $M_2$ must increase Kolmogorov complexity by at least $K(P) - O(\log n)$. Since Kolmogorov complexity is uncomputable, the authors propose perplexity ratios (specifically the Binoculars score) as a practical proxy and validate the approach on a color-based encoding scheme with 300 samples.
The theoretical result (Theorem 3) is sound and elegantly proven using symmetry-of-information arguments, establishing a 'no free lunch' principle for steganography. However, the experimental validation is limited to a single synthetic encoding scheme (synesthetic color embedding), and the connection from Kolmogorov complexity to the proposed Quadroculars score relies on strong, unverifiable assumptions about model optimality. The paper succeeds as a theoretical foundation but overstates the practical certainty of its proxy detection method.
The main theoretical bound is tight and correctly derived. The proof properly applies the chain rule and symmetry-of-information theorem from Kolmogorov complexity to establish that semantic preservation combined with lossless extraction forces the complexity increase. The connection to compression-based complexity proxies (Cilibrasi & Vitányi 2005) is well-founded, and the observation that perplexity serves as a probabilistic analogue of compression length is insightful and technically correct.
The experimental validation tests only one specific steganographic method (3-bit color codes per sentence using synesthetic associations), leaving generalizability to other embedding schemes entirely unverified. Assumption A1 (model quality) posits that $|\frac{K(x)}{T} - H_{\mathcal{M}_1}(x)| \leq \varepsilon$, which is uncheckable since $K(x)$ is uncomputable; the error bounds could be substantial in practice. The semantic-load definition (Definition 1) assumes $K(M_i|S) = O(1)$, which idealizes the fuzzy notion of natural-language equivalence and ignores that different surface forms of the 'same' meaning can have varying Kolmogorov complexity.
The paper fairly distinguishes its contribution from prior work: Cachin (2004) focused on statistical security (KL divergence) while this work addresses Kolmogorov complexity; Ryabko & Ryabko (2009) bounded encoder complexity while this bounds output complexity. The comparison to Roger & Greenblatt (2023) on AI-to-AI covert channels is appropriate. However, the claim that Binoculars detects the predicted complexity increase relies on a single significant test ($p < 10^{-6}$) with moderate effect size (mean shift 0.310), and the paper acknowledges that 'single-threshold detection would have non-trivial error rates'.
Reproducibility is significantly impaired. The encoding and decoding uses a proprietary model ('claude-sonnet-4-20250514') that is not publicly available or versioned. No code, dataset, or hyperparameters are provided—the paper merely states that 300 entries were constructed from 'Three-Body Problem' excerpts without releasing the corpus or the specific prompts used for the LLM rewriter. The Binoculars scores depend on Falcon-7B and Falcon-7B-Instruct, which are accessible, but the exact preprocessing and tokenization details necessary to replicate the scores are omitted.
Large language models can rewrite text to embed hidden payloads while preserving surface-level meaning, a capability that opens covert channels between cooperating AI systems and poses challenges for alignment monitoring. We study the information-theoretic cost of such embedding. Our main result is that any steganographic scheme that preserves the semantic load of a covertext~$M_1$ while encoding a payload~$P$ into a stegotext~$M_2$ must satisfy $K(M_2) \geq K(M_1) + K(P) - O(\log n)$, where $K$ denotes Kolmogorov complexity and $n$ is the combined message length. A corollary is that any non-trivial payload forces a strict complexity increase in the stegotext, regardless of how cleverly the encoder distributes the signal. Because Kolmogorov complexity is uncomputable, we ask whether practical proxies can detect this predicted increase. Drawing on the classical correspondence between lossless compression and Kolmogorov complexity, we argue that language-model perplexity occupies an analogous role in the probabilistic regime and propose the Binoculars perplexity-ratio score as one such proxy. Preliminary experiments with a color-based LLM steganographic scheme support the theoretical prediction: a paired $t$-test over 300 samples yields $t = 5.11$, $p < 10^{-6}$.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.