OpenShift

OpenShift is a popular enterprise Kubernetes distribution. Pachyderm can run on OpenShift with a few small tweaks in the deployment process, which will be outlined below. Please see known issues below for currently issues with OpenShift deployments.

Prerequisites

Pachyderm needs a few things to install and run successfully in any Kubernetes environment

  1. A persistent volume, used by Pachyderm's etcd for storage of system metatada. The kind of PV you provision will be dependent on your infrastructure. For example, many on-premises deployments use Network File System (NFS) access to some kind of enterprise storage.
  2. An object store, used by Pachyderm's pachd for storing all your data. The object store you use will probably be dependent on where you're going to run OpenShift: S3 for AWS, GCS for Google Cloud Platform, Azure Blob Storage for Azure, or a storage provider like Minio, EMC's ECS or Swift providing S3-compatible access to enterprise storage for on-premises deployment.
  3. Access to particular TCP/IP ports for communication.

Persistent volume

You'll need to create a persistent volume with enough space for the metadata associated with the data you plan to store Pachyderm. The pachctl deploy command for AWS, GCP and Azure creates persistent storage for you, when you follow the instructions below. A custom deploy can also create storage.
We'll show you below how to take out the PV that's automatically created, in case you want to create it outside of the Pachyderm deployment and just consume it.

We're currently developing good rules of thumb for scaling this storage as your Pachyderm deployment grows, but it looks like 10G of disk space is sufficient for most purposes.

Object store

Size your object store generously, once you start using Pachyderm, you'll start versioning all your data. You'll need four items to configure object storage

  1. The access endpoint. For example, Minio's endpoints are usually something like minio-server:9000. Don't begin it with the protocol; it's an endpoint, not an url.
  2. The bucket name you're dedicating to Pachyderm. Pachyderm will need exclusive access to this bucket.
  3. The access key id for the object store. This is like a user name for logging into the object store.
  4. The secret key for the object store. This is like the above user's password.

TCP/IP ports

For more details on how Kubernetes networking and service definitions work, see the Kubernetes services documentation.

Incoming ports (port)

These are the ports internal to the containers, You'll find these on both the pachd and dash containers. OpenShift runs containers and pods as unprivileged users which don't have access to port numbers below 1024. Pachyderm's default manifests use ports below 1024, so you'll have to modify the manifests to use other port numbers. It's usually as easy as adding a "1" in front of the port numbers we use.

Pod ports (targetPort)

This is the port exposed by the pod to Kubernetes, which is forwarded to the port. You should leave the targetPort set at 0 so it will match the port definition.

External ports (nodePorts)

This is the port accessible from outside of Kubernetes. You probably don't need to change nodePort values unless your network security requirements or architecture requires you to change to another method of access. Please see the Kubernetes services documentation for details.

The OCPify script

A bash script that automates many of the substitutions below is available at this gist. You can use it to modify a manifest created using the --dry-run flag to pachctl deploy custom, as detailed below, and then use this guide to ensure the modifications it makes are relevant to your OpenShift environment. It requires certain prerequisites, just as jq and sponge, found in moreutils.

This script may be useful as a basis for automating redeploys of Pachyderm as needed.

Best practices: Infrastructure as code

We highly encourage you to apply the best practices used in developing software to managing the deployment process.

  1. Create scripts that automate as much of your processes as possible and keep them under version control.
  2. Keep copies of all artifacts, such as manifests, produced by those scripts and keep those under version control.
  3. Document your practices in the code and outside it.

Preparing to deploy Pachyderm

Things you'll need 1. Your PV. It can be created separately.

  1. Your object store information.

  2. Your project in OpenShift.

  3. A text editor for editing your deployment manifest.

Deploying Pachyderm

1. Setting up PV and object stores

How you deploy Pachyderm on OpenShift is largely going to depend on where OpenShift is deployed. Below you'll find links to the documentation for each kind of deployment you can do. Follow the instructions there for setting up persistent volumes and object storage resources. Don't yet deploy your manifest, come back here after you've set up your PV and object store. * OpenShift Deployed on AWS * OpenShift Deployed on GCP * OpenShift Deployed on Azure * OpenShift Deployed on-premise

2. Determine your role security policy

Pachyderm is deployed by default with cluster roles. Many institutional Openshift security policies require namespace-local roles rather than cluster roles. If your security policies require namespace-local roles, use the pachctl deploy command below with the --local-roles flag.

3. Run the deploy command with --dry-run

Once you have your PV, object store, and project, you can create a manifest for editing using the --dry-run argument to pachctl deploy. That step is detailed in the deployment instructions for each type of deployment, above.

Below, find examples, with cluster roles and with namespace-local roles, using AWS elastic block storage as a persistent disk with a custom deploy. We'll show how to remove this PV in case you want to use a PV you create separately.

Cluster roles

$ pachctl deploy custom --persistent-disk aws --object-store s3 \
     <pv-storage-name> <pv-storage-size> \
     <s3-bucket-name> <s3-access-key-id> <s3-access-secret-key> <s3-access-endpoint-url> \
     --static-etcd-volume=<pv-storage-name> > manifest.json

Namespace-local roles

$ pachctl deploy custom --persistent-disk aws --object-store s3 \
     <pv-storage-name> <pv-storage-size> \
     <s3-bucket-name> <s3-access-key-id> <s3-access-secret-key> <s3-access-endpoint-url> \
     --static-etcd-volume=<pv-storage-name> --local-roles > manifest.json

4. Modify pachd Service ports

In the deployment manifest, which we called manifest.json, above, find the stanza for the pachd Service. An example is shown below.

{
    "kind": "Service",
    "apiVersion": "v1",
    "metadata": {
        "name": "pachd",
        "namespace": "default",
        "creationTimestamp": null,
        "labels": {
            "app": "pachd",
            "suite": "pachyderm"
        },
        "annotations": {
            "prometheus.io/port": "9091",
            "prometheus.io/scrape": "true"
        }
    },
    "spec": {
        "ports": [
            {
                "name": "api-grpc-port",
                "port": 650,
                "targetPort": 0,
                "nodePort": 30650
            },
            {
                "name": "trace-port",
                "port": 651,
                "targetPort": 0,
                "nodePort": 30651
            },
            {
                "name": "api-http-port",
                "port": 652,
                "targetPort": 0,
                "nodePort": 30652
            },
            {
                "name": "saml-port",
                "port": 654,
                "targetPort": 0,
                "nodePort": 30654
            },
            {
                "name": "api-git-port",
                "port": 999,
                "targetPort": 0,
                "nodePort": 30999
            },
            {
                "name": "s3gateway-port",
                "port": 600,
                "targetPort": 0,
                "nodePort": 30600
            }
        ],
        "selector": {
            "app": "pachd"
        },
        "type": "NodePort"
    },
    "status": {
        "loadBalancer": {}
    }
}
While the nodePort declarations are fine, the port declarations are too low for OpenShift. Good example values are shown below.
    "spec": {
        "ports": [
            {
                "name": "api-grpc-port",
                "port": 1650,
                "targetPort": 0,
                "nodePort": 30650
            },
            {
                "name": "trace-port",
                "port": 1651,
                "targetPort": 0,
                "nodePort": 30651
            },
            {
                "name": "api-http-port",
                "port": 1652,
                "targetPort": 0,
                "nodePort": 30652
            },
            {
                "name": "saml-port",
                "port": 1654,
                "targetPort": 0,
                "nodePort": 30654
            },
            {
                "name": "api-git-port",
                "port": 1999,
                "targetPort": 0,
                "nodePort": 30999
            },
            {
                "name": "s3gateway-port",
                "port": 1600,
                "targetPort": 0,
                "nodePort": 30600
            }
        ],

5. Modify pachd Deployment ports and add environment variables

In this case you're editing two parts of the pachd Deployment json.
Here, we'll omit the example of the unmodified version. Instead, we'll show you the modified version.

5.1 pachd Deployment ports

The pachd Deployment also has a set of port numbers in the spec for the pachd container. Those must be modified to match the port numbers you set above for each port.

{
    "kind": "Deployment",
    "apiVersion": "apps/v1",
    "metadata": {
        "name": "pachd",
        "namespace": "default",
        "creationTimestamp": null,
        "labels": {
            "app": "pachd",
            "suite": "pachyderm"
        }
    },
    "spec": {
        "replicas": 1,
        "selector": {
            "matchLabels": {
                "app": "pachd",
                "suite": "pachyderm"
            }
        },
        "template": {
            "metadata": {
                "name": "pachd",
                "namespace": "default",
                "creationTimestamp": null,
                "labels": {
                    "app": "pachd",
                    "suite": "pachyderm"
                },
                "annotations": {
                    "iam.amazonaws.com/role": ""
                }
            },
            "spec": {
                "volumes": [
                    {
                        "name": "pach-disk"
                    },
                    {
                        "name": "pachyderm-storage-secret",
                        "secret": {
                            "secretName": "pachyderm-storage-secret"
                        }
                    }
                ],
                "containers": [
                    {
                        "name": "pachd",
                        "image": "pachyderm/pachd:1.9.0rc1",
                        "ports": [
                            {
                                "name": "api-grpc-port",
                                "containerPort": 1650,
                                "protocol": "TCP"
                            },
                            {
                                "name": "trace-port",
                                "containerPort": 1651
                            },
                            {
                                "name": "api-http-port",
                                "containerPort": 1652,
                                "protocol": "TCP"
                            },
                            {
                                "name": "peer-port",
                                "containerPort": 1653,
                                "protocol": "TCP"
                            },
                            {
                                "name": "api-git-port",
                                "containerPort": 1999,
                                "protocol": "TCP"
                            },
                            {
                                "name": "saml-port",
                                "containerPort": 1654,
                                "protocol": "TCP"
                            }
                        ],

5.2 Add environment variables

There are six environment variables necessary for OpenShift 1. WORKER_USES_ROOT: This controls whether worker pipelines run as the root user or not. You'll need to set it to false 1. PORT: This is the grpc port used by pachd for communication with pachctl and the api. It should be set to the same value you set for api-grpc-port above. 1. PPROF_PORT: This is used for Prometheus. It should be set to the same value as trace-port above. 1. HTTP_PORT: The port for the api proxy. It should be set to api-http-port above. 1. PEER_PORT: Used to coordinate pachd's. Same as peer-port above. 1. PPS_WORKER_GRPC_PORT: Used to talk to pipelines. Should be set to a value above 1024. The example value of 1680 below is recommended.

The added values below are shown inserted above the PACH_ROOT value, which is typically the first value in this array. The rest of the stanza is omitted for clarity.

                        "env": [
                            {
                            "name": "WORKER_USES_ROOT",
                            "value": "false"
                            },
                            {
                            "name": "PORT",
                            "value": "1650"
                            },
                            {
                            "name": "PPROF_PORT",
                            "value": "1651"
                            },
                            {
                            "name": "HTTP_PORT",
                            "value": "1652"
                            },
                            {
                            "name": "PEER_PORT",
                            "value": "1653"
                            },
                            {
                            "name": "PPS_WORKER_GRPC_PORT",
                            "value": "1680"
                            },
                            {
                                "name": "PACH_ROOT",
                                "value": "/pach"
                            },

6. (Optional) Remove the PV created during the deploy command

If you're using a PV you've created separately, remove the PV that was added to your manifest by pachctl deploy --dry-run. Here's the example PV we created with the deploy command we used above, so you can recognize it.

{
    "kind": "PersistentVolume",
    "apiVersion": "v1",
    "metadata": {
        "name": "etcd-volume",
        "namespace": "default",
        "creationTimestamp": null,
        "labels": {
            "app": "etcd",
            "suite": "pachyderm"
        }
    },
    "spec": {
        "capacity": {
            "storage": "10Gi"
        },
        "awsElasticBlockStore": {
            "volumeID": "pach-disk",
            "fsType": "ext4"
        },
        "accessModes": [
            "ReadWriteOnce"
        ],
        "persistentVolumeReclaimPolicy": "Retain"
    },
    "status": {}
}

7. Deploy the Pachyderm manifest you modified.

$ oc create -f pachyderm.json

You can see the cluster status by using oc get pods as in upstream Kubernetes:

    $ oc get pods
    NAME                     READY     STATUS    RESTARTS   AGE
    dash-6c9dc97d9c-89dv9    2/2       Running   0          1m
    etcd-0                   1/1       Running   0          4m
    pachd-65fd68d6d4-8vjq7   1/1       Running   0          4m

Known issues

Problems related to OpenShift deployment are tracked in issues with the "openshift" label.