TensorTune — AI Inference Performance Engineering

Capabilities

What we actually do

Scoped, measurable engagements. Every recommendation is backed by profiler data, not vibes.

Inference Runtime Optimization

Benchmark and tune the path that serves your model — across llama.cpp/GGUF, ONNX Runtime, TensorRT-LLM, CUDA, Vulkan and Metal. Quantization trade-offs (Q4/Q5/Q8/FP16), prefill vs decode, KV-cache pressure, context scaling and CPU/GPU offload.

Dense · MoE · multimodal · LLM/VLM/ASR/TTS/CV/forecasting

Low-Latency Multi-User Streaming Server

A serving layer where one warm model handles many client sessions with streamed output. Continuous/in-flight batching, KV-cache slot management, admission control, cancellation — built to hold stable TTFT and inter-token latency under load.

OpenAI-compatible API · SSE/WebSocket · p50/p95/p99

Profiling & Bottleneck Analysis

Find out where your inference time and hardware budget actually go. We instrument workloads with Nsight, runtime telemetry and load tests to pin CPU, GPU, memory, scheduler, KV-cache and kernel bottlenecks — before touching a line of code.

Nsight Systems · Nsight Compute · torch.profiler · load tests

Fine-Tuning & Training Optimization

When runtime tuning isn't enough, we adapt the model — and optimize the pipeline for memory, speed and cost. We assess your data and labels, define eval sets and acceptance metrics, and apply parameter-efficient methods where they fit.

LoRA / QLoRA · mixed precision · eval & acceptance metrics

Deployment Strategy & Hardware Planning

Recommendations based on measured workload behavior, not hardware assumptions. Single- and multi-GPU experiments, VRAM capacity planning, and platform-specific optimization for NVIDIA, Vulkan-capable devices and Apple silicon — run on your infrastructure, or on ours when an independent proof-of-concept is faster.

Client hardware or our infra · single/multi-GPU · on-prem · cloud

Fixed-scope engagement

The Inference Performance Audit

Most teams are overpaying for GPUs or shipping latency they could halve — they just can't see where. Start here: a fixed-price, fixed-timeline audit that finds it.

End-to-end profiling of your live inference pipeline
Ranked bottlenecks with measured latency / throughput / VRAM
Projected cost & latency savings per fix
A prioritised roadmap your team can execute — or we do

Duration1–2 weeks

DeliverableReport + roadmap

HardwareYours or ours

NDAStandard

PriceFixed quote

Request a quote →See how it works ↗

Toolbox

The stack we work in

We pick tools by what the profiler says, not by fashion.

Runtimes & GPU

C++ 20CUDAinline PTX llama.cpp / GGUFONNX RuntimeTensorRT-LLM VulkanMetal (Apple)Triton kernels Kernel fusionMixed precisionQuantization Q4/Q5/Q8

Serving, Models & Training

vLLMContinuous batchingKV-cache slots OpenAI-compatible APISSE / WebSocketLLM · VLM ASR · TTSCV · forecastingLoRA / QLoRA Docker · K8sLoad testing

Who you work with

Senior engineers, kernel-deep.

You talk to the people writing the kernels — not an account manager. No juniors, no layers.

GPU & Inference

C++ · CUDA · Vulkan · TensorRT

Low-latency inference and rendering specialists. Production speech-to-speech and LLM pipelines (streaming ASR → vLLM → TTS), GEMM/kernel optimization on H100-class hardware, mixed precision and inline-PTX work — driven end-to-end by profiler data.

Multimedia & Vision

CV · ASR/TTS · Streaming

Low-level multimedia and computer-vision engineering with strong optimization-theory foundations. Realtime translation and TTS services, hardware-accelerated WebRTC streaming (NVENC/NVDEC), and ONNX-based detection tuned for production latency budgets.

Start a conversation

Tell us where it
hurts: cost,
latency, or both.

Describe your model, hardware target and the metric that's keeping you up at night. We'll come back with a scope and a fixed quote — usually within a couple of days.