Deploy Model¶
📹 See demo video
1. Create a Cluster¶
Note
Provisioning resources can take time. You may need to wait up to 20 minutes for Lambda to provide GPUs.
In the left sidebar, click Cluster, then hit + Create Cluster.
Fill in Cluster Configuration:
Name (e.g. test)
Cloud Provider (e.g. Lambda Labs)
Region (e.g. us-south-1)
GPU Type & Count (e.g. 8 × H100)
Hugging Face Token (paste your HF access token)
Click Create Cluster at the bottom right.
You will see an info card containing status progression: Pending → init → wait_k8s → Active. Wait until the status shows Active.
Note
If it turns Fail to create, the instance wasn’t available.
Delete the cluster in the web interface (no need to delete the instance from the Lambda Labs dashboard).
Change the configuration and try again. Most often, switching to a different region helps.
2. Create Deployments¶
Note
The deployment process itself could take up to 10 minutes to complete.
In the left sidebar, click Deployments, then hit + Create Deployment.
Search or select from existing model cards, pick the model you want to deploy (e.g. meta-llama/Llama-3.1-8B-Instruct).
Configure basics
Deployment Name: give it a descriptive name (e.g. llama-8b-test).
Target Cluster: select one of your Active clusters.
The UI will auto-detect available GPUs and memory in that cluster.
Skip—or dive into—Advanced
To quick-start, click Create Deployment now.
For finer control, click Next: Advanced. Advanced settings are grouped in three tabs:
🧠 LM Cache
CPU/Disk Offloading Buffer Size, P/D Disaggregation, CacheBlend, etc.
🤖 Model
Max Model Length, Max Number of Sequences, Dtype, etc.
⚡️ vLLM
TP Size, GPU Memory Utilization, Enable Chunked Prefill, etc.
Launch!
Once you click Create Deployment, you’ll see an info card for your deployment containing status progression.
If it fails, check logs.