Skip to content

Deploy Pachyderm With a Proxy: One Port For All External Traffic

We are now shipping Pachyderm with an optional embedded proxy allowing Pachyderm to expose one single port externally (whether you access pachd over gRPC using pachctl, or console over HTTP, for example).

See Pachyderm new high-level architecture diagram: High level architecture

This page is an add-on to existing installation instructions in the case where you chose to deploy Pachyderm with an embedded proxy. The steps below replace all or parts of the existing installation documentation. We will let you know when to use them and which section they overwrite.

TL;DR

  • When the proxy option is activated, Pachyderm is reachable through one TCP port for all incoming grpc (grpcs), console (HTTP/HTTPS), s3 gateway, OIDC, and dex traffic, then routes each call to the appropriate backend microservice without any additional configuration.
  • Enable the proxy as follow:
proxy:
  enabled: true
  service:
    type: LoadBalancer

Warning

The deployment of Pachyderm with a proxy is optional at the moment and will become permanent in the next minor release of Pachyderm.

The diagram below gives a quick overview of the layout of services and pods when using a proxy. In particular, it details how Pachyderm listens to all inbound traffic on one port, then routes each call to the appropriate backend:Infrastruture Recommendation

Note

See our reference values.yaml for all available configurable fields of the proxy.

Before any deployment in production, we recommend reading the following section to set up your production infrastructure.

Alternatively, you can skip those infrastructure prerequisites and make a quick cloud installation or jump to our local deployment section for a first encounter with Pachyderm.

Pachyderm General Infrastructure Recommendations

For production deployments, we recommend that you:

  • Provision a TCP load balancer for all HTTP/HTTPS, gRPC/gRPCs, aws s3, /dex incoming traffic. The TCP load balancer (load balanced at L4 of the OSI model) will have port 80/443 forwarding to the pachyderm-proxy service entry point. Please take a look at the diagram above.

    When a proxy is enabled with type:LoadBalancer (see the snippet of values.yaml enabling the proxy), Pachyderm creates a pachyderm-proxy service allowing your cloud platform (AWS, GKE...) to provision a TCP Load Balancer automatically.

    Note

    • You can optionally attach any additional Load Balancer configuration information to the metadata of your service by adding the appropriate annotations in the proxy.service of your values.yaml.
    • You can pre-create a static IP (For example, in GCP: gcloud compute addresses create ADDRESS_NAME --global --IP-version IPV4), then pass this external IP to the loadBalancerIP in the proxy.service of your values.yaml.
    proxy:
      enabled: true
      service:
        type: LoadBalancer
        annotations: {<add-optional-annotations-here}
        loadBalancerIP: <insert-your-proxy-external-IP-address-here>
    
  • Use a secure connection

    Make sure that you have Transport Layer Security (TLS) enabled for your incoming traffic.

  • Use Pachyderm authentication/authorization

    Pachyderm authentication is an additional security layer to protect your data from unauthorized access. See the authentication and authorization section to activate access control and set up an Identity Provider (IdP).

  • Configure access to your external IP addresses through firewalls or your Cloud Provider Network Security.

  • (Optional) Create a DNS entry for your public IP

Deploy Pachyderm in Production With a Proxy

Once you have your networking infrastructure setup, check the deployment page that matches your cloud provider and follow the installation steps that apply to the cloud provider of your choice from section 1-6. Make sure that you have enabled the proxy by adding the following lines to your values.yaml:

proxy:
  enabled: true
  service:
    type: LoadBalancer
    annotations: {see examples below}

Once your cluster is provisioned, and Pachyderm installed, replace the instructions in section 7 (Have 'pachctl' And Your Cluster Communicate) by this new set of instructions.

If you plan to deploy Console in Production, read the following and adjust your values.yaml accordingly.

Deploying Pachyderm with a proxy simplifies the setup of Console (No more dedicated DNS and ingress needed in front of Console). In a production environment, you will need to:

  • Activate Authentication.Although, if you are an Helm user, setting up your License Key in your values.yaml will activate Authentication by default. This instruction applies to users activating auth by using pachctl.
  • Update the values in the highlighted fields below.
  • Additionally, you will need to configure your Identity Provider (oidc.upstreamIDPs). See examples for the oidc.upstreamIDPs value in the helm chart values specification and read our IDP Configuration page for a better understanding of each field.
deployTarget: "<pick-your-cloud-provider>"

# enable the proxy
proxy:
  enabled: true
  service:
    type: LoadBalancer
    annotations: {...}

ingress:
  host: <insert-external-ip-address-or-dns-name>

pachd:
  storage:
    amazon:
      bucket: "<bucket-name>"
      ...
      region: "<us-east-2>"
  # pachyderm enterprise key
  enterpriseLicenseKey: "<your-enterprise-token>"

oidc:
  # populate the pachd.upstreamIDPs with an array of Dex Connector configurations.
  upstreamIDPs: []

To connect your pachctl client to your cluster

The grpc address provided when pointing your pachctl CLI at your cluster changes now that a proxy allows a single entry point. Run the following commands:

  1. Retrieve the external IP address of your TCP load balancer (or use your domain name):
    kubectl get services | grep pachyderm-proxy | awk '{print $4}'
    
  2. Update the context of your cluster using the external IP address/domain name captured above:

    echo '{"pachd_address": "grpc://<external-IP-address-or-domain-name>:80"}' | pachctl config set context "<your-cluster-context-name>" --overwrite
    
    pachctl config set active-context "<your-cluster-context-name>"
    

  3. Check that your are using the right context:

    pachctl config get active-context
    

Your cluster context name should show up. Your pachctl client now points to your cluster.

If you have deployed Console

Point your browser to http://<external-IP-address-or-domain-name>. No port number is needed. You will be prompted to log in to your Console.

If you have installed JupyterHub and the Mount Extension

The connection string to your Pachyderm cluster (check the login form accessible by clicking on the mount extension icon in the far left tab bar of your JupyterLab) now depends on whether you have deployed JupyterHub on:

  • The same cluster: grpc://pachd.<namespace>.svc.cluster.local:30650
  • An external cluster: grpc://<external-IP-address-or-domain-name>:80

Quick Cloud Deployment With a Proxy

Follow your regular QUICK Cloud Deploy documentation, but for those few steps:

  • In section 2 (Create Your Values.yaml), replace your values yaml with the YAML files provided below. Make sure to replace the dummy values with their relevant information. Then proceed with the helm installation as detailed in section 3.
  • To connect your pachctl client to your cluster,replace section 4 with the instructions detailed in the link.
  • To connect to Console, replace section 5 with the instructions provided in the link.
  • If you deployed JupyterHub (section 7), use the instructions in the link to login to the Mount Extension.

AWS

deployTarget: "AMAZON"

proxy:
  enabled: true
  service:
    type: LoadBalancer

pachd:
  storage:
    amazon:
      bucket: "bucket_name"      
      # this is an example access key ID taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html (AWS Credentials)
      id: "AKIAIOSFODNN7EXAMPLE"                
      # this is an example secret access key taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html  (AWS Credentials)          
      secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
      region: "us-east-2"          
deployTarget: "AMAZON"

proxy:
  enabled: true
  service:
    type: LoadBalancer

ingress:
  host: <insert-external-ip-address-or-dns-name>

pachd:
  storage:
    amazon:
      bucket: "<bucket-name>"                
      # this is an example access key ID taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html (AWS Credentials)
      id: "AKIAIOSFODNN7EXAMPLE"                
      # this is an example secret access key taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html  (AWS Credentials)          
      secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
      region: "<us-east-2>"
  # pachyderm enterprise key 
  enterpriseLicenseKey: "<your-enterprise-token>"
  localhostIssuer: "true"

Google

deployTarget: "GOOGLE"

proxy:
  enabled: true
  service:
    type: LoadBalancer

pachd:
  storage:
    google:
      bucket: "<bucket-name>"
      cred: |
        INSERT JSON CONTENT HERE
  externalService:
    enabled: true
deployTarget: "GOOGLE"

proxy:
  enabled: true
  service:
    type: LoadBalancer

ingress:
  host: <insert-external-ip-address-or-dns-name>

pachd:
  storage:
    google:
      bucket: "<bucket-name>"
      cred: |
        INSERT JSON CONTENT HERE
  # pachyderm enterprise key
  enterpriseLicenseKey: "<your-enterprise-token>"
  localhostIssuer: "true"

Azure

deployTarget: "MICROSOFT"

proxy:
  enabled: true
  service:
    type: LoadBalancer

pachd:
  storage:
    microsoft:
      # storage container name
      container: "blah"
      # storage account name
      id: "AKIAIOSFODNN7EXAMPLE"
      # storage account key
      secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
deployTarget: "MICROSOFT"

proxy:
  enabled: true
  service:
    type: LoadBalancer

ingress:
  host: <insert-external-ip-address-or-dns-name>


pachd:
  storage:
    microsoft:
      # storage container name
      container: "<your-container-name>"
      # storage account name
      id: "AKIAIOSFODNN7EXAMPLE"
      # storage account key
      secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
  # pachyderm enterprise key
  enterpriseLicenseKey: "<your-enterprise-token>"
  localhostIssuer: "true"

Deploy Pachyderm Locally With a Proxy

This section is an alternative to the default local deployment instructions. It uses a variant of the original one line command to enable a proxy.

Follow the Prerequisites before deploying Pachyderm (with or without Console) on your local cluster, then Connect 'pachctl' To Your Cluster.

JupyterLab users, you can also install Pachyderm JupyterLab Mount Extension on your local Pachyderm cluster to experience Pachyderm from your familiar notebooks.

Note that you can run both Console and JupyterLab on your local installation.

Prerequisites

  • If you are not using Linux, follow all the default Prerequisites installation instructions.

  • If you are a Linux user, make sure to set up your local Kubernetes Cluster with Kind while following the default Prerequisites installation instructions. Use the command below.

Then start your Kubernetes environment.

minikube start

Later, we will use minikube tunnel to make the proxy available on localhost.

Check Minikube's documentation for details.

  cat <<EOF | kind create cluster --name=kind --config=-
  kind: Cluster
  apiVersion: kind.x-k8s.io/v1alpha4
  nodes:
      - role: control-plane
        kubeadmConfigPatches:
            - |
                kind: InitConfiguration
                nodeRegistration:
                    kubeletExtraArgs:
                        node-labels: "ingress-ready=true"
        extraPortMappings:
            - containerPort: 30080
              hostPort: 80
              protocol: TCP
            - containerPort: 30443
              hostPort: 443
              protocol: TCP
    EOF

The extraPortMappings will make NodePorts in the cluster available on localhost; NodePort 30080 becomes localhost:80. This will make Pachyderm available at localhost:80 as long as this kind cluster is running.

Check Kind's documentation for details.

Deploy Pachyderm Community Edition Or Enterprise

  • Get the Repo Info:
helm repo add pach https://helm.pachyderm.com  
helm repo update 
  • Install Pachyderm by running the following command:

Attention Kind users

Set your Service type to NodePort rather than LoadBalancer in the commands below.

-- set proxy.service.type=NodePort
helm install pachd pach/pachyderm --set deployTarget=LOCAL --set proxy.enabled=true --set proxy.service.type=LoadBalancer 

This command will unlock your enterprise features and install Console Enterprise. Note that Console Enterprise requires authentication. By default, we create a default mock user (username:admin, password: password) to authenticate to Console without having to connect your Identity Provider.

  • Create a license.txt file in which you paste your Enterprise Key .
  • Then, run the following helm command to install Pachyderm's latest Enterprise Edition:

    helm install pachd pach/pachyderm --set deployTarget=LOCAL --set proxy.enabled=true --set proxy.service.type=LoadBalancer --set pachd.enterpriseLicenseKey=$(cat license.txt) --set ingress.host=localhost
    
  • Check Your Install

Check the status of the Pachyderm pods by periodically running kubectl get pods. When Pachyderm is ready for use, all Pachyderm pods must be in the Running status.

kubectl get pods

System Response: At a very minimum, you should see the following pods (console depends on your choice above):

NAME                                  READY   STATUS    RESTARTS   AGE
pod/console-55bc9f679-w4xrk           1/1     Running   0          71m
pod/etcd-0                            1/1     Running   0          70m
pod/pachd-84487d6675-cf68x            1/1     Running   0          71m
pod/pachyderm-proxy-89d5c4f65-pst9l   1/1     Running   0          71m
pod/pg-bouncer-5dd558c8dc-zjlpj       1/1     Running   0          71m
pod/postgres-0                        1/1     Running   0          70m

Connect 'pachctl' To Your Cluster

Assuming your pachd is running as shown above, you can now connect pachctl to your local cluster.

Minikube users

Open a new tab in your terminal and run minikube tunnel (the command creates a network route on your host to pachyderm-proxy service deployed with type LoadBalancer, and set its ingress to its ClusterIP, here 127.0.0.1). You will be prompted to enter your password.

  • To connect pachctl to your new Pachyderm instance, run:

    echo '{"pachd_address":"grpc://127.0.0.1:80"}' | pachctl config set context local --overwrite && pachctl config set active-context local
    

    Verify that pachctl and your cluster are connected by running pachctl version:

    System Response:

    COMPONENT           VERSION  
    pachctl             2.3.5  
    pachd               2.3.5  
    
    You are all set!

  • To connect to your Console (Pachyderm UI), point your browser to localhost (no port number needed) and authenticate using the mock User (username: admin, password: password).

  • To use pachctl, run pachctl auth login then authenticate again (to Pachyderm this time) with the mock User (username: admin, password: password).

  • Notebook users, if you have installed JupyterHub and the Mount Extension on the same cluster, the connection url to your Pachyderm cluster in the login form (click on the mount extension icon in the far left tab ) is now: grpc://pachd.<namespace>.svc.cluster.local:30650

Changes to the S3 Gateway

The pachyderm-proxy service also routes Pachyderm's S3 gateway (allowing you to access Pachyderm's repo through the S3 protocol) on port 80 (note the endpoint in the diagram below).

Global S3 Gateway with Proxy

Changes to the Enterprise Server Setup

Your enterprise server is deployed in the same way as any regular cluster with a few differences (no object-store and two PostgreSQL databases required: dex and pachyderm). The same applies when deploying an enterprise server with a proxy.

Note that the enterprise server will be deployed behind its proxy, as will each cluster registered to this enterprise server.

Attention

Enabling an embedded enterprise server with your pachd as part of the same helm installation will not work with the proxy. You can use a standalone enterprise server instead.

Follow your regular enterprise server deployment and configuration instructions, except for those few steps:

  • Section 1: Deploy an enterprise server):

    In the values.yaml provided as examples:

    • Remove the pachd.externalService section and replace it with proxy:

      proxy:
        enabled: true
        service:
          type: LoadBalancer
      
    • Update all mentions of http://<PACHD-IP>:30657/ and http://<PACHD-IP>:30658/ with http://<Enterprise-server-external-IP-or-DNS>:80/ or https://<Enterprise-server-external-IP-or-DNS>:443/

    • Your redirect_uri must be set to http(s)://<insert-external-ip-or-dns-name>/dex/callback in your IdP connector as mentioned in the IdP section of the documentation

  • Section 3: Register your cluster with the enterprise server:

    If you chose to register a cluster to an enterprise server using pachctl, change all the port numbers to 80(http)/443(https) in the pachctl enterprise register command:

    pachctl enterprise register --id <my-pachd-config-name> --enterprise-server-address <pach-enterprise-IP>:80 --pachd-address <pachd-IP>:80
    
  • Section 4: Enable auth on each cluster, use these instructions to:

    • Set up the issuer in the idp config between the enterprise server and your cluster:
    echo "issuer: http://<enterprise-external-IP-or-dns>" | pachctl idp set-config --config -
    
    • For each registered cluster, enable auth:

      pachctl auth activate --client-id <my-pachd-config-name> --redirect http://<pachd-external-IP-or-DNS>/authorization-code/callback 
      
    • Then resume the last part of instructions:

      • Make sure than your enterprise context is set up properly:
      pachctl config get active-enterprise-context
      

      If not:

      pachctl config set active-enterprise-context <my-enterprise-context-name>
      

Last update: August 20, 2022
Does this page need fixing? Edit me on GitHub