DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation
Developing optimized CUDA kernels is critical for generative AI but remains challenging even for human experts. This paper introduces DRTriton, a framework that trains a 7B-parameter LLM to convert PyTorch code into efficient Triton kernels using exclusively synthetic data. The approach combines a constraint satisfaction algorithm for program generation (CSP-DAG), curriculum reinforcement learning with decoupled rewards (DRPO), and test-time search, achieving 92% speedup on KernelBench Level 2 compared to 23% for GPT-5.2.
The paper presents a compelling approach to automated kernel generation through synthetic data and curriculum RL. The CSP-DAG formulation guarantees uniform coverage of the operator space, while the decoupled reward mechanism effectively handles the sparse reward problem in early training stages. Results on both synthetic and real-world benchmarks demonstrate strong performance, though the comparison with future/hypothetical baseline models (GPT-5.2, Claude-Sonnet-4.5) limits immediate validation. The test-time search strategy is crucial for compositional kernels but adds computational overhead not fully characterized.
The CSP-DAG algorithm for synthetic data generation is methodologically sound, using constraint programming with CP-SAT solvers to ensure valid tensor shapes and full operator coverage. The curriculum learning strategy progressively scaling from single to five-operator programs effectively stabilizes training under sparse rewards. The decoupled reward formulation (DRPO) explicitly separates correctness from speed optimization through the weighting function $\omega(o|q) = \frac{\exp(r_s(o|q)/\lambda)}{\sum_{o\in\mathcal{S}_+(q)}\exp(r_s(o|q)/\lambda)}$, which the ablation studies confirm outperforms standard GRPO when trained from the same checkpoint.
The paper critically relies on baseline comparisons with models that do not exist (GPT-5.2, Claude-Sonnet-4.5), making empirical validation impossible. The KernelBench results evaluate DRTriton with test-time search against baseline LLMs without comparable search capabilities, potentially inflating the margin of victory. The synthetic data generation, while guaranteeing coverage via the constraints $\text{MIN\_FLOPS} \leq \sum_{\text{op}\in\text{nodes}}\text{flops}(\text{op}) \leq \text{MAX\_FLOPS}$, appears limited to straight-line DAGs of primitive operators and may not capture real-world complexities like control flow or dynamic tensor shapes. Additionally, the paper omits analysis of training compute costs and wall-clock time for the test-time search procedure.
The evidence supports the claim that synthetic data training can generalize to real kernels, given the 76% accuracy on KernelBench Level 3. However, the comparisons with commercial LLMs are suspect given the temporal inconsistency of the citations (models from 2025-2026). The ablation studies properly control for the RL algorithm (DRPO vs GRPO) and reward functions ($\log$ vs power $r_s(o) = (t_{\text{torch}}/t_{\text{triton}})^\alpha$), strengthening internal validity. The paper would benefit from comparing against other specialized kernel generation systems like AutoTriton on equal footing (both with and without test-time search).
Reproducibility is compromised by the reliance on GPT-5.2 (a non-existent model) for generating the initial 2,026 SFT pairs, though DeepSeek-R1 is real and accessible. The CSP-DAG algorithm and training hyperparameters are detailed sufficiently for replication (SFT: lr $2\times10^{-6}$, batch size 64; RL: lr $1\times10^{-6}$, $\beta=100$, $\tau=5$, $\lambda=0.1$, 8 rollouts per prompt). However, no code repository or data release is mentioned, and the exact constraint templates in Appendix B are abbreviated, making it difficult to fully replicate the synthetic data generation without the complete constraint specifications for all 53 operators.
Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.