Aug 2025: Kernel Pilot V1.0 Beats TensorRT in many cases.

Results indicate that FSR-generated CUDA kernels significantly accelerate inference workloads across a wide range of TensorRT layers. In particular, several tasks exhibit notably high speedup. For example, on (Normalization Layer), (Ragged Softmax Layer), and (Reduce Layer), our method achieves speedups of 38.7$\times$, 17.2$\times$, and 20.6$\times$ respectively, compared to the manually hand-written baseline. Additionally, (Convolution Layer) and (Loop Iterator Layer) show exceptional performance gains, with speedups of 11.6$\times$ and 4.99$\times$ correspondingly.

Samples can be found at https://github.com/KernelPilot/KernelPilot-V1-TensorRT-Samples


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *