Selected Work — TensorTune

01 · Runtime tuning · Streaming LLM

Faster time-to-first-token for streaming translation

An LLM inference pipeline for live translation was slow to start responding. On an RTX 5090, we tuned the runtime configuration — batch / ubatch — to improve how input context is processed, so the translation stream begins sooner.

The win came purely from more efficient GPU utilization: no change to the model, weights, quantization or business logic, and therefore zero quality trade-off.

Where it applies

Any latency-sensitive streaming pipeline — translation, voice, chat — where time-to-first-token shapes the user experience. The MoE gain was especially large.

RTX 5090 · runtime config tuning · prefill / TTFT profiling · Qwen3-32B, Qwen3-30B-A3B MoE

44%

prefill speedup
Qwen3-30B-A3B MoE

19%

overall latency reduction (MoE)

8.7%

prefill speedup · Qwen3-32B dense (~3.5% overall)

02 · Custom CUDA kernel · MoE

Beating vendor libraries on a small MoE GEMM (H100)

MoE inference isn't one big matrix multiply — it's many small ones across selected expert blocks, repeated at every token step. We hand-wrote a CUDA kernel for the shape M=128, N=256, K=12288 (split-K, deep pipelining).

It was benchmarked against the full vendor stack: torch.matmul, cuBLAS TN/NN and cuBLASLt heuristic/autotune variants. It beat the fastest of them — cuBLASLt autotune TN.

Where it applies

MoE serving, where this small GEMM repeats across every token and expert path, so a per-call gain compounds across a full generation.

H100 · split-K32 · pipe6 · vs cuBLAS / cuBLASLt baselines

~10%

faster than the fastest
vendor solution for this shape

cuBLASLt autotune (best vendor)100%

Our kernel~90%

lower = faster (relative kernel time)

03 · Custom CUDA kernel · Large models

Beating NVIDIA's best baseline on a large GEMM (H100)

In very large models, a big share of compute is heavy dense matrix multiplication. We built a custom kernel for the large case M=N=K=16384 (F32/F16 mixed) and measured it against NVIDIA's fastest vendor baseline for that shape.

It came out ~5% faster than NVIDIA's best result. On a heavy GEMM, even a few percent shows up fast: shorter critical-path latency, higher throughput, and lower GPU-hours at scale.

The level of effort

This is the lowest level of optimization — SASS/NCU bottleneck analysis, shared memory, registers, instruction-level effects, and dozens of validated variants. Percentages, not multiples — but at production scale they compound into real GPU-hour savings.

H100 · F32F16F16F32 · SASS / Nsight Compute · vs cuBLAS / cuBLASLt

~5%

faster than NVIDIA's fastest
vendor baseline for this shape

NVIDIA best baseline100%

Our kernel~95%

lower = faster (relative kernel time)

About these numbers

All figures are measured, not modelled — captured on the stated hardware against named baselines. Client identities and code are withheld under NDA. Kernel-level wins are deliberately reported as percentages, not multiples: a 5–10% gain on one operation doesn't make the whole model 5–10% faster. The point is where these operations sit — on the per-token critical path, repeated across layers, batches and sessions — so small percentages accumulate into meaningful savings in latency, throughput and GPU cost at production scale.

The work, in numbers.

Faster time-to-first-token for streaming translation

Beating vendor libraries on a small MoE GEMM (H100)

Beating NVIDIA's best baseline on a large GEMM (H100)

About these numbers

Curious what's hiding
in your inference path?

Faster time-to-first-token for streaming translation

Beating vendor libraries on a small MoE GEMM (H100)

Beating NVIDIA's best baseline on a large GEMM (H100)

About these numbers

Curious what's hidingin your inference path?

Curious what's hiding
in your inference path?