Skip to content

Use GPUs

Set up a GPU enabled Kubernetes Cluster

Pachyderm leverages Kubernetes Device Plugins to let Kubernetes Pods access specialized hardware such as GPUs. For instructions on how to set up a GPU-enabled Kubernetes cluster through device plugins, see the Kubernetes documentation.

Pachyderm on NVIDIA DGX A100

Let’s walk through the main steps allowing Pachyderm to leverage the AI performance of your DGX A100 GPUs.

Info

Read about NVIDIA DGX A100's full userguide.

TL;DR

Support for scheduling GPU workloads in Kubernetes requires a fair amount of trial and effort. To ease the process:

  • This setup page will walk you through very detailed installation steps to prepare your Kubernetes cluster.
  • Take advantage of a user's past experience in this blog.

Here is a quick recap of what will be needed:

  • Have a working Kubernetes control plane and worker nodes attached to your cluster.
  • Install the DGX system in a hosting environment.
  • Add the DGX to your K8s API server as a worker node.

Now that the DGX is added to your API server, you can then proceed to:

  1. Enable the GPU worker node in the Kubernetes cluster by installing NVIDIA's dependencies:

    Dependencies packages and deployment methods may vary. The following list is not exhaustive and is intended to serve as a general guideline.

    • NVIDIA drivers

      For complete instructions on setting up NVIDIA drivers, visit this quickstart guide or check this summary of the steps.

    • NVIDIA Container Toolkit (nvidia-docker2)

      You may need to use different packages depending on your container engine.

    • NVIDIA Kubernetes Device Plugin

      To use GPUs in Kubernetes, the NVIDIA Device Plugin is required. The NVIDIA Device Plugin is a daemonset that enumerates the number of GPUs on each node of the cluster and allows pods to be run on GPUs. Follow those steps to deploy the device plugin as a daemonset using helm.

    Checkpoint: Run NVIDIA System Management Interface (nvidia-smi) on the CLI. It should return the list of NVIDIA GPUs.

  2. Test a sample container with GPU:

    To test whether CUDA jobs can be deployed, run a sample CUDA (vectorAdd) application.

    For reference, find the pod spec below:

    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-test
    spec:
      restartPolicy: OnFailure
      containers:
      - name: cuda-vector-add
        image: "nvidia/samples:vectoradd-cuda10.2"
        resources:
          limits:
            nvidia.com/gpu: 1
    

    Save it as gpu-pod.yaml then deploy the application:

    kubectl apply -f gpu-pod.yaml
    
    Check the logs to make sure that the app completed successfully:
    kubectl get pods gpu-test
    

  3. If the container above is scheduled successfully: install Pachyderm. You are ready to start leveraging NVIDIA's GPUs in your Pachyderm pipelines.

Note

Note that you have the option to use GPUs for compute-intensive workloads on:

Configure GPUs in Pipelines

Once your GPU-enabled Kubernetes cluster is set, you can request a GPU tier in your pipeline specifications by setting up GPU resource limits, along with its type and number of GPUs.

Important

By default, Pachyderm workers are spun up and wait for new input. That works great for pipelines that are processing a lot of new incoming commits. However, for lower volume of input commits, you could have your pipeline workers 'taking' the GPU resource as far as k8s is concerned, but 'idling' as far as you are concerned.

  • Make sure to set the autoscaling field to true so that if your pipeline is not getting used, the worker pods get spun down and the GPU resource freed.
  • Additionally, specify how much of GPU your pipeline worker will need via the resource_requests fields in your pipeline specification with ressource_requests <= resource_limits.

Below is an example of a pipeline spec for a GPU-enabled pipeline from our market sentiment analysis example:

{
  "pipeline": {
    "name": "train_model"
  },
  "description": "Fine tune a BERT model for sentiment analysis on financial data.",
  "input": {
    "cross": [
      {
        "pfs": {
          "repo": "dataset",
          "glob": "/"
        }
      },
      {
        "pfs": {
          "repo": "language_model",
          "glob": "/"
        }
      }
    ]
  },
  "transform": {
    "cmd": [
      "python", "finbert_training.py", "--lm_path", "/pfs/language_model/", "--cl_path", "/pfs/out", "--cl_data_path", "/pfs/dataset/"
    ],
    "image": "pachyderm/market_sentiment:dev0.25"
  },
  "resource_limits": {
    "gpu": {
      "type": "nvidia.com/gpu",
      "number": 1
    }
  },
  "resource_requests": {
    "memory": "4G",
    "cpu": 1
  }
}

Last update: August 20, 2022
Does this page need fixing? Edit me on GitHub