AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

cs.LG cs.PF Jaber Jaber, Osama Jaber · Mar 22, 2026

What it does

Why it matters

The system prioritizes kernels by their contribution to total runtime (Amdahl's law) and encodes expert tuning strategies into a six-tier agent playbook. While it demonstrates strong results on memory-bound operations like normalization...

Main concern

AutoKernel delivers a well-engineered, open-source system for autonomous GPU kernel optimization that excels on memory-bound transformer operations (5. 29$\times$ speedup on RMSNorm, 2.

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

GPU kernel optimization is among the most expertise-intensive tasks in ML systems engineering, often requiring weeks of manual tuning per kernel. AutoKernel proposes to automate this via an autonomous agent loop that iteratively edits Triton or CUDA C++ kernels, validates them through a five-stage correctness harness, and keeps or reverts changes based on benchmarked throughput. The system prioritizes kernels by their contribution to total runtime (Amdahl's law) and encodes expert tuning strategies into a six-tier agent playbook. While it demonstrates strong results on memory-bound operations like normalization and softmax, its compute-bound matmul performance remains significantly below vendor library baselines.

Critical review

Verdict

Bottom line

AutoKernel delivers a well-engineered, open-source system for autonomous GPU kernel optimization that excels on memory-bound transformer operations (5.29$\times$ speedup on RMSNorm, 2.82$\times$ on softmax versus eager) but fails to match cuBLAS on matrix multiplication (only 28-43% of PyTorch eager performance). The five-stage correctness harness (smoke tests, shape sweeps, numerical stability, determinism, edge cases) is rigorous and all 34 tested configurations pass without failures. However, the paper's claims about beating torch.compile are based on a limited subset (12 of 16 shown configurations), and the impressive 'community deployment' results (FP4 matmul beating CUTLASS by up to 2.15$\times$) are unverified third-party reports rather than controlled experiments. The system is best viewed as a practical tool for optimizing the long tail of memory-bound custom kernels where vendor libraries offer no fused implementation, rather than a replacement for cuBLAS on GEMMs.

“Matmul remains hard... Our Triton starter reaches 278 TFLOPS (28% of H100's 989.5 TFLOPS peak), well below cuBLAS.”

paper · Section 7.1

“All 34 configurations pass all five verification stages with zero failures across eager, compiled, and custom kernel outputs.”

paper · Section 7.1

What holds up

The five-stage correctness verification pipeline is the paper's strongest technical contribution, systematically catching compilation errors, shape-dependent boundary bugs, numerical stability issues under adversarial inputs, race conditions via determinism checks, and non-power-of-two edge cases before any performance is recorded. This ensures that the automated search does not sacrifice correctness for speed. The Amdahl's law orchestrator correctly prioritizes optimization effort by kernel runtime contribution, ensuring that agent iterations are spent where they impact end-to-end model latency. The dual-backend architecture providing both Triton (for fast iteration) and CUDA C++ (for hardware-level control) is well-designed, and the memory-bound kernel results (RMSNorm reaching 83% of H100 peak bandwidth) demonstrate that the agent can effectively apply fusion and tiling strategies from the six-tier playbook.

“Five-stage correctness pipeline. Any failure immediately rejects the candidate. Throughput is only measured after all five stages pass.”

paper · Section 4

“The orchestrator applies Amdahl's law... It transitions to the next kernel when: (1) 5 consecutive reverts, (2) 90% of GPU peak reached, (3) 2 hours elapsed, or (4) 2$\times$ speedup achieved.”

paper · Section 3.4

Main concerns

The matrix multiplication results are a significant weakness: AutoKernel's Triton kernels achieve only 0.29$\times$ to 0.43$\times$ the performance of PyTorch eager (cuBLAS), and 0.33$\times$ to 0.52$\times$ versus torch.compile on larger sizes (Table 4). Since matmul consumes 60-80% of transformer runtime per the paper's own introduction, failing to optimize the dominant operation limits practical impact for end-to-end model speedup. The paper prominently features 'community deployment' results—specifically a Triton FP4 matmul allegedly beating CUTLASS by 1.63$\times$ to 2.15$\times$—but these are unverified third-party reports with no controlled methodology, reproducible artifacts, or verification that the compared CUTLASS configurations were optimal. The scope is also limited to single-GPU individual kernels; distributed kernels, multi-device memory management, and cross-kernel fusion discovery are explicitly out of scope (Section 11).

“Matmul... 8192$^3$: 1679.5$\mu$s eager, 1916.1$\mu$s compiled, 5773.1$\mu$s ours... 0.29$\times$ vs eager, 0.33$\times$ vs compiled”

paper · Section 7.1, Table 4

“The system currently optimizes individual kernels on a single GPU; distributed kernels and multi-device memory management are out of scope.”

paper · Section 11

Evidence and comparison

The evidence strongly supports the claims for memory-bound operations (normalization, softmax, cross-entropy), where the agent successfully fuses multi-operator ATen decompositions into bandwidth-efficient Triton kernels. However, the comparison to torch.compile is presented selectively: the paper states their kernels 'beat it on 12 of the 16 configurations shown in Table 4,' but notably loses on rotary embedding (0.91$\times$ and 0.61$\times$) and performs poorly on matmul. The comparison to KernelBench (Ouyang et al., 2025) is contextual—the paper correctly notes that prior work achieved $<20$\% one-shot success, but iterative methods like CUDA Agent (Dai et al., 2026) have since reported 92-100$\%$ faster-than-torch.compile rates, suggesting AutoKernel's simple loop design may not represent the state-of-the-art in success rate. The claim that AutoKernel differs by 'starting from a complete PyTorch model' is valid but difficult to verify without end-to-end model speedup numbers being reported.

“AutoKernel outperforms torch.compile on most kernels... our starter kernels beat it on 12 of the 16 configurations shown in Table 4.”

paper · Section 7.1

“frontier reasoning models... matching the PyTorch baseline in less than 20% of the cases using one-shot generation”

Ouyang et al., KernelBench · Abstract

Reproducibility

The system is open-source with over 9,200 lines of Python, 909 lines of agent instructions (program.md), and 18 starter kernel implementations, which supports reproduction. The paper provides specific hardware details (NVIDIA H100 80GB, CUDA 12.8) and measurement methodology (CUDA event timing, 200 iterations, trimmed mean). However, reproducibility may be hindered by the stochastic nature of LLM-based editing—success depends on the specific frontier model used (not always specified), temperature settings, and the non-deterministic trajectory of the 300-400 experiments per run. The 'community deployment' FP4 matmul result lacks code artifacts or exact prompts. While the five-stage harness ensures correctness, reproducing the specific speedups would require re-running the expensive iterative loop. The git-based experiment tracking is good practice, but the paper does not clarify whether the exact experiment histories are logged and available for analysis.

“AutoKernel comprises over 9,200 lines of Python across 14 core scripts, 18 kernel implementations (9 Triton, 9 CUDA C++), 4 model definitions, and a 909-line agent instruction document”

paper · Section 3

“All measurements use FP16 precision with CUDA event timing, 200 iterations per configuration, and trimmed mean (dropping the top and bottom 10%).”

paper · Section 7

Abstract

Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow-AI/autokernel.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.