Quickstart

Learn how to deploy the latest version of Pachyderm quickly with simplified instructions and pre-set Helm values.

December 5, 2022

On this page, you will find simplified deployment instructions and Helm values to get you started with the latest release of Pachyderm on the Kubernetes Engine of your choice (AWS (EKS), Google (GKS), and Azure (AKS)).

For each cloud provider, we will give you the option to “quick deploy” Pachyderm with or without an enterprise key. A quick deployment allows you to experiment with Pachyderm without having to go through any infrastructure setup. In particular, you do not need to set up any object store or PostgreSQL instance.

💡

The deployment steps highlighted in this document are not intended for production. For production settings, please read our infrastructure recommendations. In particular, we recommend:

  • the use of a managed PostgreSQL server (RDS, CloudSQL, or PostgreSQL Server) rather than Pachyderm’s default bundled PostgreSQL.
  • the setup of a TCP Load Balancer in front of your pachd service.
  • the setup of an Ingress Controller in front of Console.

Then find your targeted Cloud provider in the Deploy and Manage section of this documentation.

⚠️

We are now shipping Pachyderm with an optional embedded proxy allowing your cluster to expose one single port externally. This deployment setup is optional.

If you choose to deploy Pachyderm with a Proxy, check out our new recommended architecture and deployment instructions.

Deploying with a proxy presents a couple of advantages:

  • You only need to set up one TCP Load Balancer (No more Ingress in front of Console).
  • You will need one DNS only.
  • It simplifies the deployment of Console.
  • No more port-forward.

1. Prerequisites #

Pachyderm is deployed on a Kubernetes Cluster.

Install the following clients on your machine before you start creating your cluster. Use the latest available version of the components listed below.

  • kubectl: the cli to interact with your cluster.
  • pachctl: the cli to interact with Pachyderm.
  • Install Helm for your deployment.
⚠️

Get a Pachyderm Enterprise key

To get a free-trial token, fill in this form, get in touch with us at sales@pachyderm.io, or on our Slack.

Select your favorite cloud provider.

💡

Note that we often use the acronym CE for Community Edition.

2. Create Your Values.yaml #

ℹ️

Pachyderm comes with a Web UI (Console) per default.

AWS #

  1. Additional client installation: Install AWS CLI

  2. Create an EKS cluster

  3. Create an S3 bucket for your data

  4. Create a values.yaml

Deploy Pachyderm CE (includes Console CE) #

 deployTarget: "AMAZON"
 pachd:
   storage:
     amazon:
       bucket: "bucket_name"      
       # this is an example access key ID taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html (AWS Credentials)
       id: "AKIAIOSFODNN7EXAMPLE"                
       # this is an example secret access key taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html  (AWS Credentials)          
       secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
       region: "us-east-2"
   externalService:
     enabled: true
 console:
   enabled: true

Deploy Pachyderm Enterprise with Console #

Note that when deploying Pachyderm Enterprise with Console, we create a default mock user (username:admin, password: password) to authenticate yourself to Console so you don’t have to connect an Identity Provider to make things work. The mock user is a Cluster Admin per default.

 deployTarget: "AMAZON"
 pachd:
   storage:
     amazon:
       bucket: "bucket_name"                
       # this is an example access key ID taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html (AWS Credentials)
       id: "AKIAIOSFODNN7EXAMPLE"                
       # this is an example secret access key taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html  (AWS Credentials)          
       secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
       region: "us-east-2"
   # pachyderm enterprise key 
   enterpriseLicenseKey: "YOUR_ENTERPRISE_TOKEN"
 console:
   enabled: true

Jump to Helm install

Google #

  1. Additional client installation: Install Google Cloud SDK

  2. Create a GKE cluster Note: Add --scopes storage-rw to your gcloud container clusters create command.

  3. Create a GCS Bucket for your data

  4. Create a values.yaml

Deploy Pachyderm CE (includes Console CE) #

 deployTarget: "GOOGLE"
 pachd:
   storage:
     google:
       bucket: "bucket_name"
       cred: |
                  INSERT JSON CONTENT HERE
   externalService:
     enabled: true
 console:
   enabled: true

Deploy Pachyderm Enterprise with Console #

Note that when deploying Pachyderm Enterprise with Console, we create a default mock user (username:admin, password: password) to authenticate yourself to Console so you don’t have to connect an Identity Provider to make things work. The mock user is a Cluster Admin per default.

 deployTarget: "GOOGLE"
 pachd:
   storage:
     google:
       bucket: "bucket_name"
       cred: |
                  INSERT JSON CONTENT HERE
   # pachyderm enterprise key
   enterpriseLicenseKey: "YOUR_ENTERPRISE_TOKEN"
 console:
   enabled: true

Jump to Helm install

Azure #

ℹ️
  1. Additional client installation: Install Azure CLI 2.0.1 or later.

  2. Create an AKS cluster

  3. Create a Storage Container for your data

  4. Create a values.yaml

Deploy Pachyderm CE (includes Console CE) #

 deployTarget: "MICROSOFT"
 pachd:
   storage:
     microsoft:
       # storage container name
       container: "blah"
       # storage account name
       id: "AKIAIOSFODNN7EXAMPLE"
       # storage account key
       secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
   externalService:
     enabled: true
 console:
   enabled: true

Deploy Pachyderm Enterprise with Console #

Note that when deploying Pachyderm Enterprise with Console, we create a default mock user (username:admin, password: password) to authenticate yourself to Console so you don’t have to connect an Identity Provider to make things work. The mock user is a Cluster Admin per default.

 deployTarget: "MICROSOFT"
 pachd:
   storage:
     microsoft:
       # storage container name
       container: "blah"
       # storage account name
       id: "AKIAIOSFODNN7EXAMPLE"
       # storage account key
       secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
   # pachyderm enterprise key
   enterpriseLicenseKey: "YOUR_ENTERPRISE_TOKEN"
 console:
   enabled: true

Jump to Helm install

3. Helm Install #

  • You will be deploying the latest GA release of Pachyderm:

    helm repo add pach https://helm.pachyderm.com
    helm repo update
    helm install pachd pach/pachyderm -f my_pachyderm_values.yaml 
  • Check your deployment:

    kubectl get pods

    The deployment takes some time. You can run kubectl get pods periodically to check the status of your deployment.

    Once all the pods are up, you should see a pod for pachd running (alongside etcd, pg-bouncer or postgres, console, depending on your installation). If you are curious about the architecture of Pachyderm, take a look at our high-level architecture diagram.

    System Response:

    NAME                           READY   STATUS    RESTARTS   AGE
    console-7b69ddf66d-bxmg5       1/1     Running   0          18h
    etcd-0                         1/1     Running   0          18h
    pachd-5db79fb9dd-b2gdq         1/1     Running   2          18h
    pg-bouncer-55d9c86768-g8lx7    1/1     Running   0          18h
    postgres-0                     1/1     Running   0          18h

4. Have ‘pachctl’ And Your Cluster Communicate #

You have deployed Pachyderm without Console #

  • Retrieve the external IP address of pachd service:

    kubectl get services | grep pachd-lb | awk '{print $4}'
  • Then update your context for pachctl to point at your cluster:

    echo '{"pachd_address": "grpc://<external-IP-address>:30650"}' | pachctl config set context "<choose-a-cluster-context-name>" --overwrite
    pachctl config set active-context "<your-cluster-context-name>"
  • If Authentication is activated (When you deploy with an enterprise key already set, for example), you need to run pachct auth login, then authenticate to Pachyderm with your mock User (username:admin, password: password), before you use pachctl.

You have deployed Pachyderm with Console #

  • To connect to your new Pachyderm instance, run:

    pachctl config import-kube local --overwrite
    pachctl config set active-context local
  • Then run pachctl port-forward (Background this process in a new tab of your terminal).

Check that your cluster is up and running #

pachctl version

System Response:

COMPONENT           VERSION
pachctl             2.4.1
pachd               2.4.1

5. Connect to Console #

To connect to your Console (Pachyderm UI):

  • Point your browser to http://localhost:4000
  • If Authentication is activated (When you deploy with an enterprise key already set, for example), you you will be prompted to authenticate: Use your mock User (username:admin, password: password).

You are all set!

6. Try our beginner tutorial. #

7. NOTEBOOKS USERS: Install Pachyderm JupyterLab Mount Extension #

Once your cluster is up and running, you can helm install JupyterHub on your Pachyderm cluster and experiment with your data in Pachyderm from your Notebook cells.

Check out our JupyterHub and Pachyderm Mount Extension page for installation instructions.

Use Pachyderm’s default image and values.yaml jupyterhub-ext-values.yaml or follow the instructions to update your own.

ℹ️

Make sure to check our data science notebook examples running on Pachyderm, from a market sentiment NLP implementation using a FinBERT model to pipelines training a regression model on the Boston Housing Dataset.