Cluster deployment¶

The following notes detail deployment-related steps for cutqc2 and cuQuantum on the Della cluster at Princeton. Other clusters at Princeton should work with more or less the same steps.

Installing and deploying cutqc2¶

The cluster has several anaconda3 modules to choose from, which help us create a new python environment to install cutqc2. At the time of this writing, the following steps were used to install cutqc2:

module load anaconda3/2024.10
conda create --name cutqc2 python=3.12 pip
conda activate cutqc2
git clone https://github.com/PrincetonUniversity/cutqc2.git
cd cutqc2
pip install -e .[dev,docs,gpu]

We now run a sample script in the examples/scripts folder to verify that everything works.

Since we’re running this on the head node, we’ll choose the simplest example, supremacy_6qubit.sh: Before running, we’ll load the modules that give us access to MPI. We’ll also run this with the environment variable CUTQC2_CORE_NUMERIC set to numpy, since the head node does not have a GPU. See the FAQ page for more details.

module load nvhpc/25.5
module load openmpi/cuda-12.9/nvhpc-25.5/4.1.8
export CUTQC2_CORE_NUMERIC=numpy
cd examples/scripts
bash supremacy_6qubit.sh

The example should take ~10 seconds to run.

When submitting jobs to the cluster’s job scheduler (SLURM), use a .sbatch file similar to examples/scripts/job.sbatch.

Installing and deploying cuQuantum¶

cuquantum (specifically, cutensornet) is a dependency of cutqc2, though it is not currently utilized for tensor contraction. For future work, it may be useful to experiment with cutensornet on Della. The following writeup details the steps to get cuQuantum working on a multi-GPU cluster with cuda-aware MPI. The steps were tested on the Della cluster @ Princeton on 09/16/2025, and are mostly a tweaked variation on the official install instructions.

cuquantum¶

wget https://developer.download.nvidia.com/compute/cuquantum/redist/cuquantum/linux-x86_64/cuquantum-linux-x86_64-25.06.0.10_cuda12-archive.tar.xz

export CUQUANTUM_ROOT environment variable to whereever you tar xf the file above, and add it to your ~/.bash_profile or equivalent.

cutensor¶

wget https://developer.download.nvidia.com/compute/cutensor/redist/libcutensor/linux-x86_64/libcutensor-linux-x86_64-2.3.0.6_cuda12-archive.tar.xz

export CUTENSOR_ROOT environment variable to whereever you tar xf the file above, and add it to your ~/.bash_profile or equivalent.

In subsequent steps, we will assume (modify any job.sbatch files accordingly if this path is different):

CUQUANTUM_ROOT=/scratch/gpfs/vineetb/ext/cuquantum-linux-x86_64-25.06.0.10_cuda12-archive
CUTENSOR_ROOT=/scratch/gpfs/vineetb/ext/libcutensor-linux-x86_64-2.3.0.6_cuda12-archive

Clone cuQuantum repo¶

cd ~
git clone git@github.com:NVIDIA/cuQuantum.git
cd cuQuantum
git checkout v25.06.0

v25.06.0 is the last tag that was tested with CUDA 12.9. Later tags might work, but we likely need module load cudatoolkit/13.0 etc in the steps below.

Compile cuda-aware MPI library¶

export CUDA_PATH=/usr/local/cuda-12.9
export MPI_PATH=/usr/local/openmpi/cuda-12.9/4.1.8/nvhpc255/
cd $CUQUANTUM_ROOT/distributed_interfaces
source activate_mpi_cutn.sh
echo $CUTENSORNET_COMM_LIB

This compiles libcutensornet_distributed_interface_mpi.so. Note down the location of $CUTENSORNET_COMM_LIB. This is used in job.sbatch files that follow. If you specify it relative to $CUQUANTUM_ROOT, you shouldn’t need to change it anyway.

cuQuantum CUDA examples¶

Single GPU example¶

Compile¶

module load cudatoolkit/12.9
cd ~/cuQuantum/samples/cutensornet
nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib -lcutensornet -lcutensor -o tensornet_example

Run¶

We can run this interactively by requesting a GPU node using srun:

srun -t 00:05:00 --mem=64G --gres=gpu:1 --pty /bin/bash
export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib:${LD_LIBRARY_PATH}
./tensornet_example

CUQUANTUM_ROOT and CUTENSOR_ROOT should be set on the compute node where you try this, of course (add them to your ~/.bash_profile or redefine them once on the compute node).

Ensure you’re back on the head node before proceeding!

cuda-aware MPI example¶

A successful run of this verifies that CUDA-aware MPI is working fine with cuQuantum.

Compile¶

module load cudatoolkit/12.9
cd ~/cuQuantum/samples/cutensornet
export MPI_PATH=/usr/local/openmpi/cuda-12.9/4.1.8/nvhpc255/
nvcc tensornet_example_mpi_auto.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib -lcutensornet -lcutensor -L${MPI_PATH}/lib64 -lmpi -o tensornet_example_mpi_auto

job.sbatch¶

Submit using sbatch job.sbatch:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=40G
#SBATCH -t 00:03:00
#SBATCH --gres=gpu:1
#SBATCH --constraint=nomig
#SBATCH --output=log.txt

module purge
module load nvhpc/25.5
module load openmpi/cuda-12.9/nvhpc-25.5/4.1.8

export CUQUANTUM_ROOT=/scratch/gpfs/vineetb/ext/cuquantum-linux-x86_64-25.06.0.10_cuda12-archive
export CUTENSOR_ROOT=/scratch/gpfs/vineetb/ext/libcutensor-linux-x86_64-2.3.0.6_cuda12-archive
export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib:${LD_LIBRARY_PATH}
export CUTENSORNET_COMM_LIB=${CUQUANTUM_ROOT}/distributed_interfaces/libcutensornet_distributed_interface_mpi.so

srun ~/cuQuantum/samples/cutensornet/tensornet_example_mpi_auto

CuQuantum Python examples¶

Create conda environment with cuquantum¶

We create the conda environment at /scratch/gpfs/vineetb/envs/cuquantum simply to save space in the home folder. Modify as needed.

module load anaconda3/2024.10
conda create --prefix /scratch/gpfs/vineetb/envs/cuquantum1 python=3.12 pip
conda activate /scratch/gpfs/vineetb/envs/cuquantum
conda install conda-forge::cuquantum-python
pip install mpi4py

Run python cuda-aware MPI example¶

The example example22_mpi_auto.py is packaged in the cuQuantum repo as python/samples/tensornet/contraction/coarse/example22_mpi_auto.py. A successful run of this verifies that CUDA-aware MPI is working fine with cuQuantum using Python.

job.sbatch¶

Submit using sbatch job.sbatch:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=80G
#SBATCH -t 00:01:00
#SBATCH --gres=gpu:1
#SBATCH --constraint="nomig"
#SBATCH --output=log.txt

module purge
module load nvhpc/25.5
module load openmpi/cuda-12.9/nvhpc-25.5/4.1.8
module load cudatoolkit/12.9
module load anaconda3/2024.10

export CUQUANTUM_ROOT=/scratch/gpfs/vineetb/ext/cuquantum-linux-x86_64-25.06.0.10_cuda12-archive
export CUTENSOR_ROOT=/scratch/gpfs/vineetb/ext/libcutensor-linux-x86_64-2.3.0.6_cuda12-archive
export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib:${LD_LIBRARY_PATH}
export CUTENSORNET_COMM_LIB=${CUQUANTUM_ROOT}/distributed_interfaces/libcutensornet_distributed_interface_mpi.so

conda activate /scratch/gpfs/vineetb/envs/cuquantum

mpirun -np 2 python ~/cuQuantum/python/samples/tensornet/contraction/coarse/example22_mpi_auto.py

Run distributed Kronecker product calculation using cuda-aware MPI¶

The following example does a distributed Kronecker product calculation using cuda-aware MPI.

import os
import sys 

import cupy as cp
from cupy.cuda.runtime import getDeviceCount
from mpi4py import MPI  # this line initializes MPI

from cuquantum.bindings import cutensornet as cutn
from cuquantum.tensornet import contract, get_mpi_comm_pointer


root = 0 
comm = MPI.COMM_WORLD
rank, size = comm.Get_rank(), comm.Get_size()

if not "CUTENSORNET_COMM_LIB" in os.environ:
    raise RuntimeError("need to set CUTENSORNET_COMM_LIB to the path of the MPI wrapper library")

if not os.path.isfile(os.environ["CUTENSORNET_COMM_LIB"]):
    raise RuntimeError("CUTENSORNET_COMM_LIB does not point to the path of the MPI wrapper library")

device_id = rank % getDeviceCount()
cp.cuda.Device(device_id).use()

expr = 'i,j->ij'
shapes = [(2**int(sys.argv[1]),), (2**int(sys.argv[2]),)]

if rank == root:
    operands = [cp.arange(*shape).astype('float32') for shape in shapes]
else:
    operands = [cp.empty(shape) for shape in shapes]

for operand in operands:
   comm.Bcast(operand, root)

handle = cutn.create()
cutn.distributed_reset_configuration(
    handle, *get_mpi_comm_pointer(comm)
)

result = contract(expr, *operands, options={'device_id' : device_id, 'handle': handle})

# Check correctness - not always possible
if rank == root and len(sys.argv) > 3 and sys.argv[3] == 'check':
   result_cp = cp.einsum(expr, *operands, optimize=True)
   print("Does the cuQuantum parallel contraction result match the cupy.einsum result?", cp.allclose(result, result_cp))

When saved as kron.py and run using the job.sbatch sbatch script above, changing the last line to:

mpirun -np 2 python kron.py 10 21

Things should work fine till around a combined size of 33 elements (2^33 = about 32G). Beyond that, we get an out-of-memory error, as expected. Use python kron.py 10 21 check for small values of the arguments to verify correctness.