DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

cs.CL cs.LG Siqi Guo, Ming Lin, Tianbao Yang · Mar 23, 2026

What it does

Why it matters

The approach combines a constraint satisfaction algorithm for program generation (CSP-DAG), curriculum reinforcement learning with decoupled rewards (DRPO), and test-time search, achieving 92% speedup on KernelBench Level 2 compared to 23%...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Developing optimized CUDA kernels is critical for generative AI but remains challenging even for human experts. This paper introduces DRTriton, a framework that trains a 7B-parameter LLM to convert PyTorch code into efficient Triton kernels using exclusively synthetic data. The approach combines a constraint satisfaction algorithm for program generation (CSP-DAG), curriculum reinforcement learning with decoupled rewards (DRPO), and test-time search, achieving 92% speedup on KernelBench Level 2 compared to 23% for GPT-5.2.

Critical review

Verdict

Bottom line

The paper presents a compelling approach to automated kernel generation through synthetic data and curriculum RL. The CSP-DAG formulation guarantees uniform coverage of the operator space, while the decoupled reward mechanism effectively handles the sparse reward problem in early training stages. Results on both synthetic and real-world benchmarks demonstrate strong performance, though the comparison with future/hypothetical baseline models (GPT-5.2, Claude-Sonnet-4.5) limits immediate validation. The test-time search strategy is crucial for compositional kernels but adds computational overhead not fully characterized.

“Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5”

DRTriton paper · Section 1

What holds up

The CSP-DAG algorithm for synthetic data generation is methodologically sound, using constraint programming with CP-SAT solvers to ensure valid tensor shapes and full operator coverage. The curriculum learning strategy progressively scaling from single to five-operator programs effectively stabilizes training under sparse rewards. The decoupled reward formulation (DRPO) explicitly separates correctness from speed optimization through the weighting function $\omega(o|q) = \frac{\exp(r_s(o|q)/\lambda)}{\sum_{o\in\mathcal{S}_+(q)}\exp(r_s(o|q)/\lambda)}$, which the ablation studies confirm outperforms standard GRPO when trained from the same checkpoint.

“The goal of our RL is to improve the log-likelihood of correct and faster Triton implementations while decreasing that of incorrect or slow ones”

DRTriton paper · Section 4.3

“DRPO consistently outperforms GRPO across all metrics”

DRTriton paper · Section 5.4

Main concerns

The paper critically relies on baseline comparisons with models that do not exist (GPT-5.2, Claude-Sonnet-4.5), making empirical validation impossible. The KernelBench results evaluate DRTriton with test-time search against baseline LLMs without comparable search capabilities, potentially inflating the margin of victory. The synthetic data generation, while guaranteeing coverage via the constraints $\text{MIN\_FLOPS} \leq \sum_{\text{op}\in\text{nodes}}\text{flops}(\text{op}) \leq \text{MAX\_FLOPS}$, appears limited to straight-line DAGs of primitive operators and may not capture real-world complexities like control flow or dynamic tensor shapes. Additionally, the paper omits analysis of training compute costs and wall-clock time for the test-time search procedure.

“With code rewriting and test-time search, our model can generalize to real-world kernels”

DRTriton paper · Section 5.3

“DRTriton (test-time search) ... 96 ... 92 ... 56 ... 76 ... 54 ... 34”

DRTriton paper · Table 2

Evidence and comparison

The evidence supports the claim that synthetic data training can generalize to real kernels, given the 76% accuracy on KernelBench Level 3. However, the comparisons with commercial LLMs are suspect given the temporal inconsistency of the citations (models from 2025-2026). The ablation studies properly control for the RL algorithm (DRPO vs GRPO) and reward functions ($\log$ vs power $r_s(o) = (t_{\text{torch}}/t_{\text{triton}})^\alpha$), strengthening internal validity. The paper would benefit from comparing against other specialized kernel generation systems like AutoTriton on equal footing (both with and without test-time search).

“All models are trained with DRPO on Stage 1 data using Qwen-2.5-Coder-1.5B as the base model”

DRTriton paper · Section 5.4

“Logarithmic ... 42.3 ... 18.6 vs Power (α=1.0) ... 32.0 ... 3.1”

DRTriton paper · Table 4

Reproducibility

Reproducibility is compromised by the reliance on GPT-5.2 (a non-existent model) for generating the initial 2,026 SFT pairs, though DeepSeek-R1 is real and accessible. The CSP-DAG algorithm and training hyperparameters are detailed sufficiently for replication (SFT: lr $2\times10^{-6}$, batch size 64; RL: lr $1\times10^{-6}$, $\beta=100$, $\tau=5$, $\lambda=0.1$, 8 rollouts per prompt). However, no code repository or data release is mentioned, and the exact constraint templates in Appendix B are abbreviated, making it difficult to fully replicate the synthetic data generation without the complete constraint specifications for all 53 operators.

“We prompt DeepSeek-R1 or GPT-5.2 to generate the corresponding Triton kernel implementation”

DRTriton paper · Section 4.2

“We employ Qwen-2.5-Coder-7B-Instruct as our base model... SFT... learning rate $2\times 10^{-6}$... RL... learning rate $1\times 10^{-6}$ with hyper-parameters $(\beta_0,\tau,\lambda)=(100,5,0.1)$”

DRTriton paper · Section 5.1

Abstract

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.