A few representative optimizations. Every figure below was measured on real hardware against real baselines — vendor libraries, stock runtimes, untuned configs. No projections, no marketing math.
An LLM inference pipeline for live translation was slow to start responding. On an RTX 5090, we tuned the runtime configuration — batch / ubatch — to improve how input context is processed, so the translation stream begins sooner.
The win came purely from more efficient GPU utilization: no change to the model, weights, quantization or business logic, and therefore zero quality trade-off.
Any latency-sensitive streaming pipeline — translation, voice, chat — where time-to-first-token shapes the user experience. The MoE gain was especially large.
MoE inference isn't one big matrix multiply — it's many small ones across selected expert blocks, repeated at every token step. We hand-wrote a CUDA kernel for the shape M=128, N=256, K=12288 (split-K, deep pipelining).
It was benchmarked against the full vendor stack: torch.matmul, cuBLAS TN/NN and cuBLASLt heuristic/autotune variants. It beat the fastest of them — cuBLASLt autotune TN.
MoE serving, where this small GEMM repeats across every token and expert path, so a per-call gain compounds across a full generation.
In very large models, a big share of compute is heavy dense matrix multiplication. We built a custom kernel for the large case M=N=K=16384 (F32/F16 mixed) and measured it against NVIDIA's fastest vendor baseline for that shape.
It came out ~5% faster than NVIDIA's best result. On a heavy GEMM, even a few percent shows up fast: shorter critical-path latency, higher throughput, and lower GPU-hours at scale.
This is the lowest level of optimization — SASS/NCU bottleneck analysis, shared memory, registers, instruction-level effects, and dozens of validated variants. Percentages, not multiples — but at production scale they compound into real GPU-hour savings.
All figures are measured, not modelled — captured on the stated hardware against named baselines. Client identities and code are withheld under NDA. Kernel-level wins are deliberately reported as percentages, not multiples: a 5–10% gain on one operation doesn't make the whole model 5–10% faster. The point is where these operations sit — on the per-token critical path, repeated across layers, batches and sessions — so small percentages accumulate into meaningful savings in latency, throughput and GPU cost at production scale.
Start with a scope call. If there's measurable room to improve, we'll find it — and quote it.
Book an audit →