Azure¶
For a quick test installation of Pachyderm on Azure (suitable for development), jump to our Quickstart page.
Before your start your installation process.
- Refer to our generic "Helm Install" page for more information on how to install and get started with
Helm
. - Read our infrastructure recommendations. You will find instructions on how to set up an ingress controller, a load balancer, or connect an Identity Provider for access control.
- If you are planning to install Pachyderm UI. Read our Console deployment instructions. Note that, unless your deployment is
LOCAL
(i.e., on a local machine for development only, for example, on Minikube or Docker Desktop), the deployment of Console requires, at a minimum, the set up on an Ingress.
The following section walks you through deploying a Pachyderm cluster on Microsoft® Azure® Kubernetes Service environment (AKS).
In particular, you will:
- Install Prerequisites
- Deploy Kubernetes
- Create an Azure Storage Container For Your Data
- Persistent Volumes Creation
- Create an Azure Managed PostgreSQL Server Database
- Deploy Pachyderm
- Have 'pachctl' and your Cluster Communicate
- Check That Your Cluster Is Up And Running
- (Optional) Install JupyterHub and Pachyderm Mount Extension to experiment with your data in Pachyderm from your Notebook cells.
1. Install Prerequisites¶
Before your start creating your cluster, install the following clients on your machine. If not explicitly specified, use the latest available version of the components listed below.
Note
This page assumes that you have an Azure Subsciption.
2. Deploy Kubernetes¶
You can deploy Kubernetes on Azure by following the official Azure Kubernetes Service documentation, use the quickstart walkthrough, or follow the steps in this section.
Attention
Pachyderm recommends running your cluster on Kubernetes 1.19.0 and above.
At a minimum, you will need to specify the parameters below:
Variable | Description |
---|---|
RESOURCE_GROUP | A unique name for the resource group where Pachyderm is deployed. For example, pach-resource-group . |
LOCATION | An Azure availability zone where AKS is available. For example, centralus . |
NODE_SIZE | The size of the Kubernetes virtual machine (VM) instances. To avoid performance issues, Pachyderm recommends that you set this value to at least Standard_DS4_v2 which gives you 8 CPUs, 28 Gib of Memory, 56 Gib SSD.In any case, use VMs that support premium storage. See Azure VM sizes for details around which sizes support Premium storage. |
CLUSTER_NAME | A unique name for the Pachyderm cluster. For example, pach-aks-cluster . |
You can choose to follow the guided steps in Azure Service Portal's Kubernetes Services or use Azure CLI.
-
Log in to Azure:
az login
This command opens a browser window. Log in with your Azure credentials. Resources can now be provisioned on the Azure subscription linked to your account.
-
Create an Azure resource group or retrieve an existing group.
az group create --name ${RESOURCE_GROUP} --location ${LOCATION}
Example:
az group create --name test-group --location centralus
System Response:
{ "id": "/subscriptions/6c9f2e1e-0eba-4421-b4cc-172f959ee110/resourceGroups/pach-resource-group", "location": "centralus", "managedBy": null, "name": "pach-resource-group", "properties": { "provisioningState": "Succeeded" }, "tags": null, "type": null }
-
Create an AKS cluster in the resource group/location:
For more configuration options: Find the list of all available flags of the
az aks create
command.az aks create --resource-group ${RESOURCE_GROUP} --name ${CLUSTER_NAME} --node-vm-size ${NODE_SIZE} --node-count <node_pool_count> --location ${LOCATION}
Example:
az aks create --resource-group test-group --name test-cluster --generate-ssh-keys --node-vm-size Standard_DS4_v2 --location centralus
-
Confirm the version of the Kubernetes server by running
kubectl version
.
See Also:
Once your Kubernetes cluster is up, and your infrastructure configured, you are ready to prepare for the installation of Pachyderm. Some of the steps below will require you to keep updating the values.yaml started during the setup of the recommended infrastructure:
3. Create an Azure Storage Container For Your Data¶
Pachyderm needs an Azure Storage Container (Object store) to store your data.
To access your data, Pachyderm uses a Storage Account with permissioned access to your desired container. You can either use an existing account or create a new one in your default subscription, then use the JSON key associated with the account and pass it on to Pachyderm.
To create a new storage account, follow the steps below:
Warning
The storage account name must be unique in the Azure location.
-
Set up the following variables:
- STORAGE_ACCOUNT - The name of the storage account where you store your data.
- CONTAINER_NAME - The name of the Azure blob container where you store your data.
-
Create an Azure storage account:
System response:az storage account create \ --resource-group="${RESOURCE_GROUP}" \ --location="${LOCATION}" \ --sku=Premium_LRS \ --name="${STORAGE_ACCOUNT}" \ --kind=BlockBlobStorage
{ "accessTier": null, "creationTime": "2019-06-20T16:05:55.616832+00:00", "customDomain": null, "enableAzureFilesAadIntegration": null, "enableHttpsTrafficOnly": false, "encryption": { "keySource": "Microsoft.Storage", "keyVaultProperties": null, "services": { "blob": { "enabled": true, ...
Make sure that you set Stock Keeping Unit (SKU) to
Premium_LRS
and thekind
parameter is set toBlockBlobStorage
. This configuration results in a storage that uses SSDs rather than standard Hard Disk Drives (HDD). If you set this parameter to an HDD-based storage option, your Pachyderm cluster will be too slow and might malfunction. -
Verify that your storage account has been successfully created:
az storage account list
-
Obtain the key for the storage account (
STORAGE_ACCOUNT
) and the resource group to be used to deploy Pachyderm:STORAGE_KEY="$(az storage account keys list \ --account-name="${STORAGE_ACCOUNT}" \ --resource-group="${RESOURCE_GROUP}" \ --output=json \ | jq '.[0].value' -r )"
Note
Find the generated key in the Storage accounts > Access keys section in the Azure Portal or by running the following command az storage account keys list --account-name=${STORAGE_ACCOUNT}
.
-
Create a new storage container within your storage account:
az storage container create --name ${CONTAINER_NAME} \ --account-name ${STORAGE_ACCOUNT} \ --account-key "${STORAGE_KEY}"
4. Persistent Volumes Creation¶
etcd and PostgreSQL (metadata storage) each claim the creation of a pv.
If you plan to deploy Pachyderm with its default bundled PostgreSQL instance, read the warning below and jump to the deployment section:
Warning
The metadata service (Persistent disk) generally requires a small persistent volume size (i.e. 10GB) but high IOPS (1500), therefore, depending on your disk choice, you may need to oversize the volume significantly to ensure enough IOPS.
If you plan to deploy a managed PostgreSQL instance (Recommended in production), read the following section.
5. Create an Azure Managed PostgreSQL Server Database¶
By default, Pachyderm runs with a bundled version of PostgreSQL. For production environments, we strongly recommend that you disable the bundled version and use a PostgreSQL Server instance.
This section will provide guidance on the configuration settings you will need to:
- Create an environment to run your Azure PostgreSQL Server databases.
- Create two databases (pachyderm and dex).
- Update your values.yaml to turn off the installation of the bundled postgreSQL and provide your new instance information.
Note
It is assumed that you are already familiar with PostgreSQL Server, or will be working with an administrator who is.
Create A PostgreSQL Server Instance¶¶
Info
Find the details of the steps and available parameters to create a PostgreSQL Server instance with Azure Console in Azure Documentation "Create an Azure Database for PostgreSQL server by using the Azure portal".
Alternatively, you can use the cli and run az postgres server create
with your relevant parameters.
In the Azure console, choose the Azure Database for PostgreSQL servers service. You will be asked to pick your server type: Single Server
or Hyperscale
(for multi-tenant applications), then configure your DB instance as follows.
SETTING | Recommended value |
---|---|
subscription and resource group | Pick your existing resource group. Important Your Cluster and your Database must be deployed in the same ressource group. |
server name | Name your instance. |
location | Create a database in the region matching your Pachyderm cluster. |
compute + storage | The standard instance size (GP_Gen5_4 = Gen5 VMs with 4 cores) should work. Remember that Pachyderm's metadata services require high IOPS (1500). Oversize the disk accordingly |
Master username | Choose your Admin username. ("postgres") |
Master password | Choose your Admin password. |
You are ready to create your instance.
Example
az postgres server create \
--resource-group <your_resource_group> \
--name <your_server_name> \
--location westus \
--sku-name GP_Gen5_2 \
--admin-user <server_admin_username> \
--admin-password <server_admin_password> \
--ssl-enforcement Disabled \
--version 11
Warning
- Make sure that your PostgreSQL version is
>= 11
- Keep the SSL setting
Disabled
.
Once created, go back to your newly created database, and:
- Open the access to your instance:
Note
Azure provides two options for pods running on an AKS worker nodes to access a PostgreSQL DB instance, pick what fit you best:
- Create a firewall rule on the Azure DB Server with a range of IP addresses that encompasses all IPs of the AKS Cluster nodes (this can be a very large range if using node auto-scaling).
- Create a VNet Rule on the Azure DB Server that allows access from the subnet the AKS nodes are in. This is used in conjunction with the Microsoft.Sql VNet Service Endpoint enabled on the cluster subnet.
You can also choose the more secure option to deny public access to your PostgreSQL instance then Create a private endpoint in the K8s vnet. Read more about how to configure a private link using CLI on Azure's documentation
Alternativelly, in the Connection Security of your newly created server, Allow access to Azure services (This is equivalent to running az postgres server firewall-rule create --server-name <your_server_name> --resource-group <your_resource_group> --name AllowAllAzureIps --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0
).
- In the Essentials page of your instance, find the full server name and admin username that will be required in your values.yaml.
Create Your Databases¶
After the instance is created, those two commands create the databases that pachyderm uses.
az postgres db create -g <your_group> -s <server_name> -n pachyderm
az postgres db create -g <your_group> -s <server_name> -n dex
Note
Note that the second database must be named dex
. Read more about dex on PostgreSQL on Dex's documentation.
Pachyderm will use the same user to connect to pachyderm
as well as to dex
.
Update your values.yaml¶
Once your databases have been created, add the following fields to your Helm values:
global:
postgresql:
postgresqlUsername: "see admin username above"
postgresqlPassword: "password"
# The server name of the instance
postgresqlDatabase: "pachyderm"
# The postgresql database host to connect to.
postgresqlHost: "see server name above"
# The postgresql database port to connect to. Defaults to postgres server in subchart
postgresqlPort: "5432"
postgresql:
# turns off the install of the bundled postgres.
# If not using the built in Postgres, you must specify a Postgresql
# database server to connect to in global.postgresql
enabled: false
6. Deploy Pachyderm¶
You have set up your infrastructure, created your data container and a Managed PostgreSQL instance, and granted your cluster access to both: you can now finalize your values.yaml and deploy Pachyderm.
Optional: If you plan to deploy with Console
If you plan to deploy Pachyderm with Console, follow these additional instructions and add the relevant fields in your values.yaml.
Update Your Values.yaml¶
Note
If you have not created a Managed PostgreSQL Server instance, replace the Postgresql section below with postgresql:enabled: true
in your values.yaml. This setup is not recommended in production environments.
If you have previously tried to run Pachyderm locally, make sure that you are using the right Kubernetes context first.
-
Verify cluster context:
kubectl config current-context
This command should return the name of your Kubernetes cluster that runs on Azure.
If you have a different context displayed, configure
kubectl
to use your Azure configuration:az aks get-credentials --resource-group ${RESOURCE_GROUP} --name ${CLUSTER_NAME}
System Response:
Merged "${CLUSTER_NAME}" as current context in /Users/test-user/.kube/config
-
Update your values.yaml
Update your values.yaml with your container name (see example of values.yaml here) or use our minimal example below.
deployTarget: "MICROSOFT" pachd: storage: microsoft: # storage container name container: "container_name" # storage account name id: "AKIAIOSFODNN7EXAMPLE" # storage account key secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" externalService: enabled: true global: postgresql: postgresqlUsername: "see admin username above" postgresqlPassword: "password" # The server name of the instance postgresqlDatabase: "pachyderm" # The postgresql database host to connect to. postgresqlHost: "see server name above" # The postgresql database port to connect to. Defaults to postgres server in subchart postgresqlPort: "5432" postgresql: # turns off the install of the bundled postgres. # If not using the built in Postgres, you must specify a Postgresql # database server to connect to in global.postgresql enabled: false
Check the list of all available helm values at your disposal in our reference documentation or on Github.
Deploy Pachyderm On The Kubernetes Cluster¶
-
Now you can deploy a Pachyderm cluster by running this command:
$ helm repo add pach https://helm.pachyderm.com $ helm repo update $ helm install pachd -f values.yaml pach/pachyderm --version <version-of-the-chart>
System Response:
Refer to our generic Helm documentation for more information on how to select your chart version.NAME: pachd LAST DEPLOYED: Mon Jul 12 18:28:59 2021 NAMESPACE: default STATUS: deployed REVISION: 1
Pachyderm pulls containers from DockerHub. It might take some time before the
pachd
pods start. You can check the status of the deployment by periodically runningkubectl get all
.When pachyderm is up and running, get the information about the pods:
kubectl get pods
Once the pods are up, you should see a pod for
pachd
running (alongside etcd, pg-bouncer, postgres, or console, depending on your installation).System Response:
NAME READY STATUS RESTARTS AGE pachd-1971105989-mjn61 1/1 Running 0 54m ...
Note: Sometimes Kubernetes tries to start
pachd
nodes before theetcd
nodes are ready which might result in thepachd
nodes restarting. You can safely ignore those restarts. -
Finally, make sure that
pachctl
talks with your cluster.
7. Have 'pachctl' And Your Cluster Communicate¶
Assuming your pachd
is running as shown above, make sure that pachctl
can talk to the cluster.
If you are exposing your cluster publicly: 1. Retrieve the external IP address of your TCP load balancer or your domain name:
```shell
kubectl get services | grep pachd-lb | awk '{print $4}'
```
-
Update the context of your cluster with their direct url, using the external IP address/domain name above:
echo '{"pachd_address": "grpc://<external-IP-address-or-domain-name>:30650"}' | pachctl config set context "<your-cluster-context-name>" --overwrite
pachctl config set active-context "<your-cluster-context-name>"
-
Check that your are using the right context:
$ pachctl config get active-context
Your cluster context name should show up.
If you're not exposing pachd
publicly, you can run:
# Background this process because it blocks.
$ pachctl port-forward
8. Check That Your Cluster Is Up And Running¶
Attention
If Authentication is activated (When you deploy Console, for example), you will need to run pachct auth login
, then authenticate to Pachyderm with your User, before you use pachctl
.
$ pachctl version
System Response:
COMPONENT VERSION
pachctl 2.2.0
pachd 2.2.0
9. NOTEBOOKS USERS: Install Pachyderm JupyterLab Mount Extension¶
Once your cluster is up and running, you can helm install JupyterHub on your Pachyderm cluster and experiment with your data in Pachyderm from your Notebook cells.
Check out our JupyterHub and Pachyderm Mount Extension page for installation instructions.
Use Pachyderm's default image and values.yaml jupyterhub-ext-values.yaml
or follow the instructions to update your own.
Note
Make sure to check our data science notebook examples running on Pachyderm, from a market sentiment NLP implementation using a FinBERT model to pipelines training a regression model on the Boston Housing Dataset.