Problem
Vision-language models had to run at production latency on cloud VMs, serving field operations with strict SLAs.
Approach
- 1.Profiled model inference and tuned batching, warm pools, and request shaping for stable p95 latency.
- 2.Deployed across Azure and GCP VM instances with health-checked routing.
- 3.Built dashboards for latency, throughput, and error budgets feeding back into model and infra changes.