AMD MI300X Kernels: Should You Build Custom Ones?

🚀 Custom GPU kernels on MI300X provide up to 3x faster inference than PyTorch defaults.
🧠 RMSNorm and SwiGLU optimizations are key to making LLMs run faster on AMD hardware.
🔧 ROCm's Composable Kernels and new tools make kernel development much easier.
💸 Faster inference means fewer GPUs are needed, which cuts AI deployment costs.
⚙️ Performance gets even better when custom kernels work with async scheduling and batching.

Why Kernel Optimization Matters in the MI300X Era

AI inference at scale means more than just building good models. It means making them run faster, cheaper, and more reliably. AMD’s MI300X GPU is new, and it gives startups and automation platforms like Bot-Engine a strong choice instead of NVIDIA. But just having powerful GPUs is not enough. If you make language bots or AI content systems that need high speed, slow inference can hurt your business. This is why custom GPU kernels are important, especially on MI300X, where the default settings often run slow. So, should you make them?

Meet the AMD MI300X: Specs and Use Case Fit

The AMD MI300X is a data center GPU made for AI work that uses a lot of inference, especially for large language models (LLMs). It is AMD's strongest accelerator so far and competes well with NVIDIA's top AI products.

Here’s what makes it different:

🔹 192GB of Unified HBM3 Memory: This is much more memory than other products. It lets models stay fully in memory, so you don't need expensive sharding or offloading that cause delays.
🔹 5.3 TB/s of Memory Bandwidth: This much throughput allows for faster data access and feeding rates for AI tasks that need a lot of computing.
🔹 ROCm System Support: ROCm (Radeon Open Compute) is open-source, unlike private systems. It keeps changing and growing, becoming easier for developers to use.

How It Fits

These specs make the MI300X a good choice for:

Chatbots & Conversational Agents: Here, always having access to large transformer models is very important for quick responses.
Multimodal AI Systems: Like using video or image input with LLMs to make text. These need smooth tasks with lots of data.
Heavy Inference Workloads: Running hundreds or thousands of queries each second. This is great for SaaS or cloud automation.

It is not just about power. The MI300X runs bigger models faster and cheaper.

GPU Kernel Performance 101: VLLM, Torch, and the Optimization Gap

Most AI inference systems today use frameworks like:

PyTorch
VLLM (Very Large Language Models)
ONNX

These platforms have general GPU kernels built in. These are default software programs that tell the GPU how to do things like matrix multiplication, normalization, and activation functions.

The Problem

These kernels work well with established systems like CUDA (from NVIDIA). But they often don't perform as well on newer platforms like AMD ROCm, especially on the MI300X. This is because:

🧱 The default kernels are not yet made for AMD’s specific GPU design.
💡 Performance improvements for NVIDIA cards often don't work for MI300X.

For example, early tests show that PyTorch or VLLM inference on MI300X runs slower. It often uses only part of the memory bandwidth or computing power available.

The Performance Gap

This is the performance gap: your AMD hardware can do more, but your code isn't letting it. The answer? Switch from general GPU kernels to custom ones.

What’s a Custom GPU Kernel — and Should You Care?

What is a Custom Kernel?

A custom GPU kernel is a special piece of code for a compute task. It is made for a specific workload, model design, or hardware. Instead of using general default settings, you can write code that best handles tasks, memory, and execution threads to get the most from the MI300X.

Why It Matters to You

You might want to think about custom kernels if:

🕒 Response Time Matters: Chatbots, automation agents, and search systems need quick responses. Faster replies make users happier.
💰 GPU Use is Important: Making inference faster means you need fewer GPUs, which lowers cloud costs, especially for large systems.
⚙️ Default Kernels Slow You Down: If moving to AMD cut your performance, it's often because the kernels aren't optimized.

Even making one layer better (like RMSNorm) can give you 2–3x faster performance. This alone might make the work worth it.

Kernel Types That Show Real Speedups on MI300X

Not all parts of an LLM system get the same benefit from custom work. But research and tests have found several key kernels where optimization gives big results:

📌 RMSNorm Kernels

Layer Normalization is common in transformers. RMSNorm (Root Mean Square Normalization) uses a lot of bandwidth. This variant often slows things down when using default kernels.

PyTorch Performance: ~170ms on MI300X
Custom ROCm Kernel: ~72ms
Gain: ~2.3x

📌 SwiGLU MLP Kernels

SwiGLU (a type of GELU activation in LLM MLP layers) can be sped up by combining operations.

PyTorch Default: ~520ms
Hand-Tuned Kernel: ~171ms
Gain: ~3x

📌 Skinny GEMMs

"Skinny" General Matrix Multiplication kernels are made to work best when one matrix size is smaller. You often find these in attention parts of models.

These kernels make the MI300X’s big bandwidth useful right away. This lowers computing load and raises the amount of work done.

Other Areas to Make Better:

Embedding lookup and projection layers
Positional encodings
Layer fusions (combining multiple operations into one kernel)

How ROCm Supports Custom Kernels

ROCm made it hard to develop kernels early on, but things have changed a lot. Here’s what you can use now:

🔧 Composable Kernels (CK)

A library with parts you can use to define, test, and improve low-level kernel building blocks.
It lets you adjust data layouts, tiling, thread dispatch logic, and other things.

🧮 hipBLAS and rocWMMA

Fast matrix math basic functions.
Needed for speeding up GEMM and MLP operations.

⚡ Tensile + YAML Auto-Tuners

These let you set up, create automatically, and test kernels using YAML files.
They are very important for finding the best tiling or scheduling plans based on MI300X's design features.

🤖 MIOpen

AMD's open-source version of cuDNN.
It has more support for convolution operations, which are often used in models that mix vision and text.

With these tools, it’s easier than ever to start. This is especially true when you use them with guides and performance logs from early users like Bot-Engine.

Walking Through a Kernel Build: Zero to Fast Inference

If you're ready to start, here’s a step-by-step guide:

1. Find the Slow Parts

Use profiling tools such as VTune, perf, or rocm-profiler.
Find layers like normalization, MLP blocks, or attention heads that cause delays.

2. Write or Change the Kernel

Start with CK templates or use ROCm HCC++ to write your code.
Focus on getting the most from shared memory and reducing global memory conflicts.

3. Adjust and Test

Test it against default PyTorch or VLLM operations.
Use tools like CK Performance Benchmarking to get real-world results.

4. Put it in Your System

Replace the found operation using PyTorch extensions or add it to VLLM’s operation registry.
Check that the output is the same before you use it.

Pro Tips:

Match HBM3 memory access with 128B regions to reduce conflicts.
Do not let warps split—kernels without branches often work best.
Use wavefront sizes that work with MI300X’s SIMD units (usually 32- or 64-wide).

Performance Trade-Offs: Generality vs. Specialization

Going "custom" costs something. Here’s how they compare:

Feature	Default Kernel	Custom Kernel
Portability	✅ High	❌ Low
Speed	❌ Low	✅ High
Maintenance Effort	✅ Low	❌ High
Hardware Specificity	❌ Generic	✅ Optimized for MI300X

What you use it for matters. If you run one model for thousands of users daily, a 30% gain quickly adds up. But for small systems, it might not be worth the extra development work.

Should You Build Custom Kernels at Bot-Scale?

If you are using many AI agents, the benefits compared to the cost are very good:

✅ Yes, If:

You run 20–100 bot instances and must meet real-time service agreements.
You want to cut the cost for each inference by getting rid of GPUs you don't need.
You plan to grow your system in AMD-focused data centers or with mixed hardware.

Custom kernels are not just faster. They make whole systems work better.

Custom vs. Prepackaged: When Off-The-Shelf Is “Good Enough”

Not every situation needs custom work. Use the default options when:

You are building early versions of products or testing new features.
Your service agreement allows for responses under one second.
You don't have experts for low-level GPU work in your team.

But in live systems where every millisecond costs, off-the-shelf options can quickly slow things down.

Tooling Support: Helping You Build Faster Kernels

Here’s a list of tools that make it easier to learn:

Composable Kernels: The main set of tools for kernel writers who work with AMD.
MIGraphX: AMD’s compiler system for improving ML graphs from start to finish.
Triton: First for CUDA only, now also for ROCm. It is great for building kernels from higher-level parts.
rocm-profiler / VTune / perf: These check for slow spots and how hardware is used.

Most of these tools come with sample kernels to help you start. Community help and documentation have gotten a lot better in the last few months.

Bottlenecks Beyond Kernels: Optimizing Full Pipelines

Think about more than just the kernel. Sometimes, getting faster results comes from the layers above:

🧰 Key System Tools:

Data Batching: Load and group inference queries well.
Async Scheduling: Do model loading, execution, and data return at the same time.
Operator Fusion: Put operations like activation + matmul together into fewer steps.
ONNX Pruning: Make models simpler by removing repeated parts and unused blocks.
Good Host ↔ Device Transfers: Use pinned memory and streams to move data.

These layers increase kernel gains and help you get more from the MI300X hardware system.

Should You Build Custom Kernels on MI300X?

Say "Yes" If:

You are handling many live requests.
You don't like PyTorch or VLLM's slower performance on AMD GPUs.
You are building important AI systems where every second is vital.

Say "No" If:

You are experimenting or just checking if an idea works.
You don't have the people or time to keep up low-level code.
Your CPU is the main limit, not your GPU.

Middle ground: Start with RMSNorm or SwiGLU. They give some of the biggest gains for the least amount of development work.

MI300X and the Future of Inference at Scale

The MI300X could be AMD’s best chance to be a big part of the AI inference change. It is priced well against competitors, has top memory features, and is finally supported by an improving software system. General frameworks still fall behind CUDA products. But the ROCm system now gives everything needed to get full performance, especially with custom GPU kernels.

Startups and automation providers like Bot-Engine show it can be done: 2–3x faster for each operation, fewer GPUs for large systems, and a clear way to work better.

MI300X is here. The chance to get ahead is still here. Kernel optimization is how you take it.

Citations

Advanced Micro Devices. (2023). MI300X Data Center GPU Overview. https://www.amd.com/en/products/server-accelerators/mi300x

AMD Developer Central. (2024). ROCm Performance Kernel Benchmarks on MI300X. https://rocm.docs.amd.com/projects/benchmark-suite

Bot Engine Labs. (2024). 3x Inference Speed on MI300X via RMSNorm & MLP Custom Kernels. https://www.botengine.com/mi300x-speedup-case-study