Aug 2025: Kernel Pilot V1.0 Beats TensorRT in many cases.

Aug 4, 2025

—

Results indicate that FSR-generated CUDA kernels significantly accelerate inference workloads across a wide range of TensorRT layers. In particular, several tasks exhibit notably high speedup. For example, on (Normalization Layer), (Ragged Softmax Layer), and (Reduce Layer), our method achieves speedups of 38.7$\times$, 17.2$\times$, and 20.6$\times$ respectively, compared to the manually hand-written baseline. Additionally, (Convolution Layer) and (Loop Iterator Layer) show exceptional performance gains, with speedups of 11.6$\times$ and 4.99$\times$ correspondingly.

Samples can be found at https://github.com/KernelPilot/KernelPilot-V1-TensorRT-Samples

Aug 2025: Kernel Pilot V1.0 Beats TensorRT in many cases.

Comments

Leave a Reply Cancel reply