Run Commands

Cluster Backup

Learn how to back-up and restore the state of a production cluster.

May 30, 2023

This page will walk you through the main steps required to manually back up and restore the state of a Pachyderm cluster in production. Details on how to perform those steps might vary depending on your infrastructure and cloud provider / on-premises setup. Refer to your provider’s documentation.

Overview #

Pachyderm state is stored in two main places:

Backing up a Pachyderm cluster involves snapshotting both the object store and the PostgreSQL database(s), in a consistent state, at a given point in time.

Restoring it involves re-populating the database(s) and the object store using those backups, then recreating a Pachyderm cluster.

ℹ️
  • Make sure that you have a bucket for backup use, separate from the object store used by your cluster.
  • Depending on the reasons behind your cluster recovery, you might choose to use an existing vs. a new instance of PostgreSQL and/or the object store.

Manual Back Up Of A Pachyderm Cluster #

Before any manual backup:

ℹ️
  • Backups incur downtime until operations are resumed.
  • Operational best practices include notifying Pachyderm users of the outage and providing an estimated time when downtime will cease.
  • Downtime duration is a function of the size of the data be to backed up and the networks involved; Testing before going into production and monitoring backup times on an ongoing basis might help make accurate predictions.

Suspend Operations #

⚠️

Before starting, make sure that your context points to the server you want to pause by running pachctl config get active-context.

To pause Pachyderm:

Back Up The Databases And The Object Store #

This step is specific to your database and object store hosting.

⚠️

A production setting of Pachyderm implies that you are running a managed PostgreSQL instance.

📖

For on-premises Kubernetes deployments, check the vendor documentation for your on-premises PostgreSQL for details on backing up and restoring your databases.

📖

For on-premises Kubernetes deployments, check the vendor documentation for your on-premises object store for details on backing up and restoring a bucket.

Resuming operations #

Once your backup is completed, resume your normal operations by scaling pachd back up. It will take care of restoring the worker pods:

Restore Pachyderm #

There are two primary use cases for restoring a cluster:

  1. Your data have been corrupted, preventing your cluster from functioning correctly. You want the same version of Pachyderm re-installed on the latest uncorrupted data set.
  2. You have upgraded a cluster and are encountering problems. You decide to uninstall the current version and restore the latest backup of a previous version of Pachyderm.

Depending on your scenario, pick all or a subset of the following steps:

Restore The Databases And Objects #

Deploy Pachyderm Into The New Cluster #

Finally, update the copy of your original Helm values to point Pachyderm to the new databases and the new object store, then use Helm to install Pachyderm into the new cluster.

Connect ‘pachctl’ To Your Restored Cluster #

And check that your cluster is up and running.