Deploy Model¶

📹 See demo video

1. Create a Cluster¶

Note

Provisioning resources can take time. You may need to wait up to 20 minutes for Lambda to provide GPUs.

In the left sidebar, click Cluster, then hit + Create Cluster.
Fill in Cluster Configuration:
- Name (e.g. test)
- Cloud Provider (e.g. Lambda Labs)
- Region (e.g. us-south-1)
- GPU Type & Count (e.g. 8 × H100)
- Hugging Face Token (paste your HF access token)
Click Create Cluster at the bottom right.
You will see an info card containing status progression: Pending → init → wait_k8s → Active. Wait until the status shows Active.

Note

If it turns Fail to create, the instance wasn’t available.

Delete the cluster in the web interface (no need to delete the instance from the Lambda Labs dashboard).
Change the configuration and try again. Most often, switching to a different region helps.

Note

The deployment process itself could take up to 10 minutes to complete.

In the left sidebar, click Deployments, then hit + Create Deployment.
Search or select from existing model cards, pick the model you want to deploy (e.g. meta-llama/Llama-3.1-8B-Instruct).
Configure basics
- Deployment Name: give it a descriptive name (e.g. llama-8b-test).
- Target Cluster: select one of your Active clusters.
- The UI will auto-detect available GPUs and memory in that cluster.
Skip—or dive into—Advanced
- To quick-start, click Create Deployment now.
- For finer control, click Next: Advanced. Advanced settings are grouped in three tabs:
  - 🧠 LM Cache
    - CPU/Disk Offloading Buffer Size, P/D Disaggregation, CacheBlend, etc.
  - 🤖 Model
    - Max Model Length, Max Number of Sequences, Dtype, etc.
  - ⚡️ vLLM
    - TP Size, GPU Memory Utilization, Enable Chunked Prefill, etc.
Launch!
- Once you click Create Deployment, you’ll see an info card for your deployment containing status progression.
- If it fails, check logs.