Llama 405b is a large model which requires lots of memory.

Quantization MethodWeight Memory# 80GB A100 GPUs

I have access to a 4-node SLURM cluster each with 4 A100 80GB GPUs each.

So how do we get these to work together?

Step 1: Multi-Node SLURM Configuration

Running jobs through sbatch requires us to be careful about specifying the number of nodes/jobs/gpus we will be running.

Here we have 2 scripts:

  1. vllm_run.sh - This script will be submitted to SLURM.
  2. vllm_node.sh - This script will be run on each individual node

Here's the first: vllm_run.sh

#SBATCH --job-name=vllm-multinode         # Job name
#SBATCH --nodes=4                         # Number of nodes
#SBATCH --gres=gpu:4                      # Number of GPUs per node (adjust as needed)
#SBATCH --cpus-per-task=128               # Number of CPUs per task (adjust as needed)
#SBATCH --ntasks-per-node=1               # Number of tasks per node
#SBATCH --time=02:00:00                   # Max runtime (HH:MM:SS)
#SBATCH --partition=<YOUR PARTITION>      # Partition (queue) name

srun --ntasks=4 --cpus-per-task=128 --gres=gpu:4 --exclusive ./vllm_start.sh

There's a few important things here:

  1. We launch a task on --nnodes=4 nodes
  2. We Specify --ntasks-per-node=1 so that one task will run on each node
  3. We specify --cpus-per-task=128. Sometimes slurm clusters will default to only a single CPU.
  4. We run srun, specifying the number of CPUs/GPUs again. This is important to ensure that the resources are allocated correctly. (If I ommit the --cpus-per-task flag in the srun command, each instance crawls with one CPU and 4 A100s)

Step 2: Downloading the model locally

I have my $HF_HOME environment variable set to a network filesystem. This is fine for single-node jobs, but for multi-node jobs this can mess up loading. Since each node performs non-contiguous reads of different pieces of the model, it's best to copy the model to a local directory on each node.

Let's first set up a virtualenv and install the huggingface-cli:

# install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | env UV_INSTALL_DIR="<somewhere on your path>" sh
# make the venv
uv venv
# install huggingface-cli
source .venv/bin/activate
uv pip install huggingface-cli
# Download it to $HF_HOME
export MODEL_PATH="neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16"
huggingface-cli download $MODEL_PATH

Beware that if you're downloading the versions directly from meta, you should add --exclude="original/* to the huggingface-cli download command. This will prevent the download of the pytorch model weights.

Step 3: Downloading VLLM

On my cluster, I can't use docker or slurm's docker-based system, so I use Singularity. Let's download VLLM to $SINGULARITY_IMAGE_DIR:

export SINGULARITY_IMAGE_DIR=<path to your singularity images>
singularity pull $SINGULARITY_IMAGE_DIR/vllm-openai_v0.6.1.post2.sif docker://vllm/vllm-openai:v0.6.1.post2

Now we're ready to write our vllm_node.sh script.

Step 3: Per-node script

This script is quite complicated. It does the following:

  1. Sets up the environment variables

    source /etc/profile
    export PATH="$HOME/.bin:$PATH"
    # Set environment variables
    export HF_HOME=<YOUR_HF_CACHE>
    export SINGULARITY_IMAGE_DIR=<path to your singularity images>
    export MODEL_PATH="neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16"
  2. Copy the model from a network filesystem to a local directory on the node

    # This should persist across reboots and be local to the node
    echo "Syncing HF"
    rsync -a --links -r $HF_HOME $TARGET_HF_HOME
    echo "HF Synced"

    Here we use rsync -a --links -r to copy the HF cache. Since the HF cache uses local symlinks, everything should be preserved nicely.

  3. Determine the correct IP address for the head node

    NODELIST=$(scontrol show hostnames $SLURM_NODELIST)
    NODE0_IP=$(host -4 $(echo $NODELIST | awk '{print $1}') | awk '{print $4}')
    NNODES=$(echo $NODELIST | wc -w)

    This gets the IP of the first node in the list, the number of nodes, and the node rank.

  4. Set up the required NCCL infiniband environment variables

    export NCCL_DEBUG=INFO
    export NCCL_IB_HCA=mlx5
    # You should run `ifconfig` to see what your infiniband interface is called
    export NCCL_SOCKET_IFNAME=ib0 
    # Modify this based on what you need in your system
    # Same with this
    export NCCL_P2P_LEVEL=SYS 

    In order to figure out the values to put here, you should run ibv_devinfo to find your HCA and nvidia-smi topo -m to see what the connectivity is between your GPUs/PCIe buses/Infiniband cards, as well as ifconfig to find the name of your infiniband interface.

  5. Set up the ray command to run

    VLLM_COMMAND="vllm serve --dtype auto --api-key token-abc123 \
        --tensor-parallel-size 4 \
        --pipeline-parallel-size $NNODES \
        --enable-prefix-caching \
    RAY_START_CMD="ray start"
    if [ "${NODE_RANK}" == "0" ]; then
        # We're in rank 0, so we start the head node, wait a bit, print out the status
        RAY_START_CMD+=" --head --port=6379 && sleep 5 && ray status"
        # Then we start the model server
        # Otherwise, we just run a blocking comand on each worker
        RAY_START_CMD+=" --block --address=${NODE0_IP}:6379"

    Here's where we define our VLLM topology. We're using 4 nodes with 4 way pipeline parallelism and 4 way tensor parallelism within each node.

  6. Serve the model

    module load singularity/3.7.0 # This is the version of singularity I use on my cluster
    mkdir -p /var/tmp/home
    singularity exec \
        --nv \ # Use the GPU
        --no-home \ # Don't mount the home directory
        -B /var/tmp/home:$HOME \ # Mount a local dir to home
        -B $HF_HOME:/hf_home \ # Bind the HF cache
        --env HF_HOME=/hf_home \ # Set the HF_HOME environment variable
        $SINGULARITY_IMAGE_DIR/vllm-openai_v0.6.1.post2.sif bash -c "$RAY_START_CMD"

Step 4: Putting it all together

Now we can write the whole start script vllm_start.sh


source /etc/profile
export PATH="$HOME/.bin:$PATH"

# Set environment variables
export SINGULARITY_IMAGE_DIR=<path to your singularity images>
export MODEL_PATH="neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16"


# update hf cache
echo "Syncing HF"
rsync -a --links -r $HF_HOME $TARGET_HF_HOME
echo "HF Synced"

# Get the hostnames
NODELIST=$(scontrol show hostnames $SLURM_NODELIST)
NODE0_IP=$(host -4 $(echo $NODELIST | awk '{print $1}') | awk '{print $4}')
NNODES=$(echo $NODELIST | wc -w)

export NCCL_IB_HCA=mlx5
# You should run `ifconfig` to see what your infiniband interface is called
# Modify this based on what you need in your system
# Same with this

VLLM_COMMAND="vllm serve --dtype auto --api-key token-abc123 --tensor-parallel-size 4 --pipeline-parallel-size $NNODES --enable-prefix-caching $MODEL_PATH"
RAY_START_CMD="ray start"
if [ "${NODE_RANK}" == "0" ]; then
    # We're in rank 0, so we start the head node, wait a bit, print out the status
    RAY_START_CMD+=" --head --port=6379 && sleep 5 && ray status"
    # Then we start the model server
    # Otherwise, we just run a blocking comand on each worker
    RAY_START_CMD+=" --block --address=${NODE0_IP}:6379"

module load singularity/3.7.0 # This is the version of singularity I use on my cluster
mkdir -p /var/tmp/home
singularity exec \
    --nv \ # Use the GPU
    --no-home \ # Don't mount the home directory
    -B /var/tmp/home:$HOME \ # Mount a local dir to home
    -B $HF_HOME:/hf_home \ # Bind the HF cache
    --env HF_HOME=/hf_home \ # Set the HF_HOME environment variable
    $SINGULARITY_IMAGE_DIR/vllm-openai_v0.6.1.post2.sif bash -c "$RAY_START_CMD"

Final: Running the model

Now we can run sbatch vllm_run.sh and watch our model get served!

My nickname is will. Correspondence to 'my nickname' at swaglu dot com will reach me.