Customize the Test Cluster
When you use RCC-Run, an ephemeral RCC cluster is created with Terraform to run Slurm batch jobs on your behalf. The RCC cluster is defined using the rcc-tf module. The default variable concretizations are provided in rcc-run/etc/rcc-ephemeral/default/fluid.auto.tfvars. This default configuration provides you with a n1-standard-16 controller with a 1TB pd-ssd disk and a single compute partition, consisting of 5x c2-standard-8 instances.
The rcc-run builder provides a mechanism to customize the cluster so that you can define compute partitions that meet your testing needs. You are able to add instances with GPUs, specify partitions for a heterogeneous cluster (see machine types available on Google Cloud), specify the zone to deploy to, change the controller size, shape, and disk properties, and even add a Lustre file system.
Getting Started
To customize the cluster, you can add a tfvars definition file that is similar to the rcc-run/etc/rcc-ephemeral/default/fluid.auto.tfvars. For reference, the rcc-run/etc/rcc-ephemeral/io.tf file defines all of the variables available for concretizing a rcc-ephemeral
cluster.
It is recommended that you start by creating a file in your repository called rcc.auto.tfvars
with the following contents
cluster_name = "<name>"
project = "<project>"
zone = "<zone>"
controller_image = "<image>"
disable_controller_public_ips = false
login_node_count = 0
suspend_time = 2
compute_node_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
partitions = [
{ name = "c2-standard-8"
machine_type = "c2-standard-8"
image = "<image>"
image_hyperthreads = true
static_node_count = 0
max_node_count = 5
zone = "<zone>"
compute_disk_type = "pd-standard"
compute_disk_size_gb = 50
compute_labels = {}
cpu_platform = null
gpu_count = 0
gpu_type = null
gvnic = false
network_storage = []
preemptible_bursting = false
vpc_subnet = null
exclusive = false
enable_placement = false
regional_capacity = false
regional_policy = null
instance_template = null
},
]
create_filestore = false
create_lustre = false
You’ll notice that their are a few template variables in this example that are demarked by <>
. The rcc-run build step is able to substitute values for these variables at build-time based on options provided to the command lined interface. The example above provides a good starting point with some of the necessary template variables in place. It is not recommended to remove the template variables for <name>
, <project>
, <zone>
, or <image>
.
For your reference, template variables for rcc-ephemeral
clusters that are substituted at run-time are given in the table below.
Template Variable |
Value/CLI Option |
Description |
---|---|---|
<name> |
frun-{build-id}[0:7] |
Name of the ephemeral cluster |
<project> |
–project |
Google Cloud Project ID |
<zone> |
–zone |
Google Cloud zone |
<image> |
–gce-image |
Google Compute Engine VM Image self-link |
<build_id> |
–build-id |
Google Cloud Build build ID |
<vpc_subnet> |
–vpc-subnet |
Google Cloud VPC Subnetwork |
<service_account> |
–service-account |
Google Cloud Service Account email address |
Customize Partitions
Partitions are used to define the type of compute nodes available to you for testing. Each partition consists of a homogeneous pool of machines. While each partition has 22 variables to concretely define it, we’ll cover a few of the options here to help you make informed decisions when defining partitions for testing.
name
The partition name is used to identify a homogeneous group of compute nodes. When writing your RCC Run CI File, you will set the partition
field to one of the partition names set in your tfvars
file.
machine_type
The machine type refers to a Google Compute Engine machine type. If you define multiple partitions with differing machine types, this gives you the ability to see how your code’s performance varies across different hardware
max_node_count
This is the maximum number of nodes that can be created in this partition. When tests are run, the cluster will automatically manage provisioning compute nodes to run benchmarks and tear them down upon completion. Keep in mind that you need to ensure that you have sufficient Quota for the machine type, gpus, and disks in the region that your cluster is deployed to.
image
The image
expects a self-link to a VM image for the cluster. It is recommended that you leave this field set to the template variable "<image>"
so that rcc-run can set this field for you. The default image that RCC uses is projects/research-computing-cloud/global/images/family/rcc-run-foss
, which includes Singularity and OpenMPI 4.0.5.
gpu_type / gpu_count
The gpu_type
field is used to set the type of GPU to attach to each compute node in the partition. Possible values are
nvidia-tesla-k80
nvidia-tesla-p100
nvidia-tesla-v100
nvidia-tesla-p4
nvidia-tesla-t4
nvidia-tesla-a100 (A2 instances only)
The gpu_count
field is used to set the number of GPUs per machine in the partition. For most GPUs, you can set this to 0, 1, 2, 4, or 8. Currently, GPUs must be used with an n1 machine type on Google Cloud (except for the A100 GPUs).
Keep in mind that each GPU type is available in certain zones and that there are restrictions on the ratio of vCPU to GPU.