CloudTune Build Log

CloudTune: Distributed GPU Broker Playbook

January 2025 8 min read

When I started CloudTune, I did not want another model-training dashboard. I wanted the missing piece of the LLM stack: a one-click GPU broker that can fine-tune, serve, and attest runs across AWS, GCP, and Azure with the rigor that regulated industries expect. The hard part was never GPUs themselves; it was the volatility of spot markets that behave more like high-frequency trading venues than cloud APIs.

CloudTune is engineered for survival under volatility. Everything in the platform, from Terraform modules to LoRA checkpoints, assumes GPUs can vanish and budgets can spike mid-run. That forced me to design the system like a resilient market infrastructure, not a single-cloud script.

Terraform modules that respect reality

Most IaC GPU templates assume capacity is stable. CloudTune ships three Terraform modules (AWS, GCP, Azure) that expose a GPU bidder with safeguards:

Attempt spot capacity, then fall back to on-demand when supply collapses.
Maintain rolling checkpoints so preemptions cost minutes, not hours.
Record provenance for each run so receipts stay audit-ready.

With this layer, the platform can slide workloads from AWS when H100 pools disappear to GCP or Azure when prices reset. Every allocation event is logged and signed so downstream compliance tooling (Axiom OS) can prove where training actually happened.

FastAPI scheduler as a market-aware broker

The CloudTune control plane is a FastAPI service that behaves like a lightweight matching engine. Incoming jobs are normalized into a schema containing:

Model family and LoRA/QLoRA parameters.
Training window, expected runtime, and budget ceiling.
Urgency class, dataset lineage, and required receipts.

The scheduler selects the cheapest GPU that satisfies both SLA and evidence requirements. Training is never fire-and-forget. Each run produces a cryptographically linked bundle (via Axiom) proving the dataset, recipe, container hash, GPU topology, and final checkpoints. That is how CloudTune earns trust with enterprises that refuse to adopt black-box tooling.

LoRA / QLoRA that does not fall apart under preemption

LoRA is light, but QLoRA with 4 and 8 bit quantization can crumble when a GPU is reclaimed. We kept runs stable with:

Shadow checkpoints every 300 to 500 steps.
Delta-only replays that restore gradients without resetting the world.
KV cache pruning ahead of recomputation cycles.

Together these techniques trim fine-tuning cost by roughly 20 to 35 percent across 7B to 13B families while staying resilient to cloud turbulence.

Observability as a first-class citizen

CloudTune ships with Prometheus, Grafana, and OpenTelemetry from day one. Prometheus scrapes the broker, trainers, and GPUs. Grafana highlights real-time burn, throughput, and failure domains. OpenTelemetry traces cross-cloud scheduling decisions, and all request-scoped logs are compressed to S3 for audit trails. The observability stack powers a "time-to-receipt" metric that measures how quickly a user request moves from run -> train -> validate -> receipt.

Inference receipts stay under 60 seconds. Training receipts consume less than 5 percent of total wall-clock runtime. Those guarantees let founders and researchers operate CloudTune in production without losing visibility.

Lessons learned

LLM infrastructure teams ultimately face two options:

Move fast and pray the GPUs never fail.
Design for the chaotic physics of GPU markets.

CloudTune chooses the second. The mission is simple: let anyone fine-tune and serve LLMs with reliability, reproducibility, and compliance baked into the workflow. GPUs are only the beginning; the same playbook can manage inference clusters, evaluation fleets, and agentic pipelines that demand receipts.