Results indicate that FSR-generated CUDA kernels significantly accelerate inference workloads across a wide range of TensorRT layers. In particular, several tasks exhibit notably high speedup. For example, on (Normalization Layer), (Ragged Softmax Layer), and (Reduce Layer), our method achieves speedups of 38.7$\times$, 17.2$\times$, and 20.6$\times$ respectively, compared to the manually hand-written baseline. Additionally, (Convolution Layer) and (Loop Iterator Layer) show exceptional performance gains, with speedups of 11.6$\times$ and 4.99$\times$ correspondingly.
Samples can be found at https://github.com/KernelPilot/KernelPilot-V1-TensorRT-Samples
Leave a Reply