MLOps / Deployment · 2025 — Present

LLM Deployment & Inference Optimisation

Llama 3.2B Vision & Gemini Flash on Azure / GCP VMs — tuned for real-time field operations

Met at p95

Latency target

Production-grade

Throughput

Azure + GCP

Deployment

Problem

Vision-language models had to run at production latency on cloud VMs, serving field operations with strict SLAs.

1.Profiled model inference and tuned batching, warm pools, and request shaping for stable p95 latency.
2.Deployed across Azure and GCP VM instances with health-checked routing.
3.Built dashboards for latency, throughput, and error budgets feeding back into model and infra changes.

Loading…