Deploy Pachyderm on AWS

Learn how to deploy a Pachyderm cluster on AWS.

December 5, 2022

This article walks you through deploying a Pachyderm cluster on Amazon Elastic Kubernetes Service (EKS).

Architecture Diagram #

AWS Arch

Before You Start #

Before you can deploy Pachyderm on an EKS cluster, verify that you have the following prerequisites installed and configured:

1. Deploy Kubernetes by using eksctl #

⚠️

Pachyderm requires running your cluster on Kubernetes 1.19.0 and above.

Use the eksctl tool to deploy an EKS cluster in your Amazon AWS environment. The eksctl create cluster command creates a virtual private cloud (VPC), a security group, and an IAM role for Kubernetes to create resources. For detailed instructions, see Amazon documentation.

To deploy an EKS cluster, complete the following steps:

  1. Deploy an EKS cluster:

    eksctl create cluster --name <name> --version <version> \
    --nodegroup-name <name> --node-type <vm-flavor> \
    --nodes <number-of-nodes> --nodes-min <min-number-nodes> \
    --nodes-max <max-number-nodes> --node-ami auto

    Example

    eksctl create cluster --name pachyderm-cluster --region us-east-2 --profile <your named profile>
  2. Verify the deployment:

    kubectl get all

    System Response:

    NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
    service/kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   23h

Once your Kubernetes cluster is up, and your infrastructure is configured, you are ready to prepare for the installation of Pachyderm. Some of the steps below will require you to keep updating the values.yaml started during the setup of the recommended infrastructure.

ℹ️

Pachyderm recommends securing and managing your secrets in a Secret Manager. Learn about the set up and configuration of your EKS cluster to retrieve the relevant secrets from AWS Secrets Manager then resume the following installation steps.

2. Create an S3 bucket #

Create an S3 object store bucket for data #

Pachyderm needs an S3 bucket (Object store) to store your data. You can create the bucket by running the following commands:

⚠️

The S3 bucket name must be globally unique across the entire Amazon region.

  • Set up the following system variables:

    • BUCKET_NAME — A globally unique S3 bucket name.
    • AWS_REGION — The AWS region of your Kubernetes cluster. For example, us-west-2 and not us-west-2a.
  • If you are creating an S3 bucket in the us-east-1 region, run the following command:

    aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION}
  • If you are creating an S3 bucket in any region but the us-east-1 region, run the following command:

    aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION} --create-bucket-configuration LocationConstraint=${AWS_REGION}
  • Verify that the S3 bucket was created:

    aws s3 ls

You now need to give Pachyderm access to your bucket either by:

📖

IAM roles provide finer grained user management and security capabilities than access keys. Pachyderm recommends the use of IAM roles for production deployments.

Add An IAM Role And Policy To Your Service Account #

Before you can make sure that the containers in your pods have the right permissions to access your S3 bucket, you will need to Create an IAM OIDC provider for your cluster.

Then follow the steps detailled in Create an IAM Role And Policy for your Service Account.

In short, you will:

  1. Retrieve your OpenID Connect provider URL:

    1. Go to the AWS Management console.
    2. Select your cluster instance in Amazon EKS.
    3. In the Configuration tab of your EKS cluster, find your OpenID Connect provider URL and save it. You will need it when creating your IAM Role.
  2. Create an IAM policy that gives access to your bucket:

    1. Create a new Policy from your IAM Console.
    2. Select the JSON tab.
    3. Copy/Paste the following text in the JSON tab:
    {
          "Version": "2012-10-17",
          "Statement": [
                {
          "Effect": "Allow",
                "Action": [
                      "s3:ListBucket"
                ],
                "Resource": [
                      "arn:aws:s3:::<your-bucket>"
                ]},{
          "Effect": "Allow",
                "Action": [
                      "s3:PutObject",
                      "s3:GetObject",
                      "s3:DeleteObject"
                ],
                "Resource": [
                      "arn:aws:s3:::<your-bucket>/*"
                ]}
          ]
    }

    Replace <your-bucket> with the name of your S3 bucket.

  3. Create an IAM role as a Web Identity using the cluster OIDC procider as the identity provider.

    1. Create a new Role from your IAM Console.
    2. Select the Web identity Tab.
    3. In the Identity Provider drop down, select the OpenID Connect provider URL of your EKS and sts.amazonaws.com as the Audience.
    4. Attach the newly created permission to the Role.
    5. Name it.
    6. Retrieve the Role arn. You will need it in your values.yaml annotations when deploying Pachyderm.

(Optional) Set Up Bucket Encryption #

To set up bucket encryption, see Amazon S3 Default Encryption for S3 Buckets.

3. Enable Your Persistent Volumes Creation #

etcd and PostgreSQL (metadata storage) each claim the creation of a persistent volume. Although Pachyderm uses gp2 default EBS volumes, we strongly recommend using SSD gp3 in production.

For Production #

  1. Create an IAM OIDC provider for your cluster. You might already have completed this step if you chose to create an IAM Role and Policy to give your containers permission to access your S3 bucket.
  2. Create a CSI Driver service account whose IAM Role will be granted the permission (policy) to make calls to AWS APIs.
  3. Install Amazon EBS Container Storage Interface (CSI) driver on your cluster configured with your created service account.

See the official AWS documentation for more details.

For Non-Production #

For non production deployments, use the default bundled version of PostgreSQL: Go to the deployment of Pachyderm

4. Create an AWS Managed PostgreSQL Database #

By default, Pachyderm runs with a bundled version of PostgreSQL. For production environments, it is strongly recommended that you disable the bundled version and use an RDS PostgreSQL instance.

⚠️

Note that Aurora Serverless PostgreSQL is not supported and will not work.

Create An RDS Instance #

📖

Find the details of all the steps highlighted below in AWS Documentation: “Getting Started” hands-on tutorial.

  1. In the RDS console, create a database in the region matching your Pachyderm cluster.
  2. Choose the PostgreSQL engine.
  3. Select a PostgreSQL version >= 13.3.
  4. Configure your DB instance as follows:
SETTING Recommended value
DB instance identifier Fill in with a unique name across all of your DB instances in the current region.
Master username Choose your Admin username.
Master password Choose your Admin password.
DB instance class The standard default should work. You can change the instance type later on to optimize your performances and costs.
Storage type and Allocated storage If you choose gp2, remember that Pachyderm’s metadata services require high IOPS (1500). Oversize the disk accordingly (>= 1TB).
If you select io1, keep the 100 GiB default size.
Read more information on Storage for RDS on Amazon’s website.
Storage autoscaling If your workload is cyclical or unpredictable, enable storage autoscaling to allow RDS to scale up your storage when needed.
Standby instance We highly recommend creating a standby instance for production environments.
VPC Select the VPC of your Kubernetes cluster. Attention: After a database is created, you can’t change its VPC.
Read more on VPCs and RDS on Amazon documentation.
Subnet group Pick a Subnet group or Create a new one.
Read more about DB Subnet Groups on Amazon documentation.
Public access Set the Public access to No for production environments.
VPC security group Create a new VPC security group and open the postgreSQL port or use an existing one.
Password authentication or Password and IAM database authentication Choose one or the other.
Database name In the Database options section, enter Pachyderm’s Database name (We are using pachyderm in this example.) and click Create database to create your PostgreSQL service. Your instance is running.
Warning: If you do not specify a database name, Amazon RDS does not create a database.
  1. If you plan to deploy a standalone cluster (i.e., if you do not plan to register your cluster with a separate enterprise server, you must create a second database named dex in your RDS instance for Pachyderm’s authentication service. Read more about dex on PostgreSQL in Dex’s documentation.

  2. Additionally, create a new user account and grant it full CRUD permissions to both pachyderm and (when applicable) dex databases. Read about managing PostgreSQL users and roles in this blog. Pachyderm will use the same username to connect to pachyderm as well as to dex.

Update your values.yaml #

Once your databases have been created, add the following fields to your Helm values:

global:
  postgresql:
    postgresqlUsername: "username"
    postgresqlPassword: "password" 
    # The name of the database should be Pachyderm's ("pachyderm" in the example above), not "dex" 
    # See also 
    # postgresqlExistingSecretName: "<yoursecretname>"
    postgresqlDatabase: "databasename"
    # The postgresql database host to connect to. Defaults to postgres service in subchart
    postgresqlHost: "RDS CNAME"
    # The postgresql database port to connect to. Defaults to postgres server in subchart
    postgresqlPort: "5432"

postgresql:
  # turns off the install of the bundled postgres.
  # If not using the built in Postgres, you must specify a Postgresql
  # database server to connect to in global.postgresql
  enabled: false

5. Deploy Pachyderm #

You have set up your infrastructure, created your S3 bucket and an AWS Managed PostgreSQL instance, and granted your cluster access to both: you can now finalize your values.yaml and deploy Pachyderm.

Update Your Values.yaml #

ℹ️

If you have not created a Managed PostgreSQL RDS instance, replace the Postgresql section below with postgresql:enabled: true in your values.yaml. This setup is not recommended in production environments.

Volume Type:
💡

Retain (ideally in version control) a copy of the Helm values used to deploy your cluster. It might be useful if you need to restore a cluster from a backup.

Deploy Pachyderm On The Kubernetes Cluster #

  • You can now deploy a Pachyderm cluster by running this command:

    helm repo add pach https://helm.pachyderm.com
    helm repo update
    helm install pachyderm -f values.yaml pach/pachyderm --version <version-of-the-chart>

    System Response:

    NAME: pachd
    LAST DEPLOYED: Mon Jul 12 18:28:59 2021
    NAMESPACE: default
    STATUS: deployed
    REVISION: 1

    The deployment takes some time. You can run kubectl get pods periodically to check the status of deployment. When Pachyderm is deployed, the command shows all pods as READY:

    kubectl wait --for=condition=ready pod -l app=pachd --timeout=5m

    System Response

    pod/pachd-74c5766c4d-ctj82 condition met

    Note: If you see a few restarts on the pachd nodes, it means that Kubernetes tried to bring up those pods before etcd was ready. Therefore, Kubernetes restarted those pods. You can safely ignore this message.

  • Finally, make sure that pachctl talks with your cluster.

6. Have ‘pachctl’ And Your Cluster Communicate #

Exposed Publicly?:

7. Check That Your Cluster Is Up And Running #

⚠️

If Authentication is activated (When you deploy with an enterprise key already set, for example), you need to run pachct auth login, then authenticate to Pachyderm with your User, before you use pachctl.

pachctl version

System Response:

COMPONENT           VERSION
pachctl             2.4.1
pachd               2.4.1