Local Installation¶
This guide covers how you can quickly get started using Pachyderm locally on macOS®, Linux®, or Microsoft® Windows®. To install Pachyderm on Windows, first look at Deploy Pachyderm on Windows.
Pachyderm is a data-centric pipeline and data versioning application written in go that runs on top of a Kubernetes cluster. A common way to interact with Pachyderm is by using Pachyderm command-line tool pachctl
, from a terminal window. To check the state of your deployment, you will also need to install kubectl
, Kubernetes command-line tool.
Additionally, we will show you how to deploy and access Pachyderm UIs JupyterLab Mount Extension and Console on your local cluster.
Note that each web UI addresses different use cases:
- JupyterLab Mount Extension allows you to experiment and explore your data, then build your pipelines' code from your familiar Notebooks.
- Console helps you visualize your DAGs (Directed Acyclic Graphs), monitor your pipeline executions, access your logs, and troubleshoot while your pipelines are running.
Warning
- A local installation is not designed to be a production
environment. It is meant to help you learn and experiment quickly with Pachyderm. - A local installation is designed for a single-node cluster.
This cluster uses local storage on disk and does not create
Persistent Volumes (PVs). If you want to deploy a production multi-node
cluster, follow the instructions for your cloud provider or on-prem
installation as described in Deploy Pachyderm.
New Kubernetes nodes cannot be added to this single-node cluster.
Pachyderm uses Helm
for all deployments.
Prerequisites¶
For a successful local deployment of Pachyderm, you will need:
- A Kubernetes cluster running on your local environment (pick the virtual machine of your choice):
- Docker Desktop,
- Minikube
- Kind
- Oracle® VirtualBox™
- Helm to deploy Pachyderm on your Kubernetes cluster.
- Pachyderm Command Line Interface (
pachctl
) to interact with your Pachyderm cluster. - Kubernetes Command Line Interface
kubectl
to interact with your underlying Kubernetes cluster.
Setup A Local Kubernetes Cluster¶
Pick the virtual machine of your choice.
Using Minikube¶
On your local machine, you can run Pachyderm in a minikube virtual machine.
Minikube is a tool that creates a single-node Kubernetes cluster. This limited
installation is sufficient to try basic Pachyderm functionality and complete
the Beginner Tutorial.
To configure Minikube, follow these steps:
- Install minikube and VirtualBox in your operating system as described in
the Kubernetes documentation. -
Start
minikube
:Linux users, add thisminikube start
--driver
flag:minikube start --driver=kvm2
Note
Any time you want to stop and restart Pachyderm, run minikube delete
and minikube start
. Minikube is not meant to be a production environment
and does not handle being restarted well without a full wipe.
Using Kubernetes on Docker Desktop¶
You can use Kubernetes on Docker Desktop instead of Minikube on macOS or Linux
by following these steps:
-
In the Docker Desktop Preferences, enable Kubernetes:
-
From the command prompt, confirm that Kubernetes is running:
kubectl get all
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 5d
-
To reset your Kubernetes cluster that runs on Docker Desktop, click
the Reset Kubernetes cluster button. See image above.
Using Kind¶
-
Install Kind according to its documentation.
-
From the command prompt, confirm that Kubernetes is running:
kubectl get all
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 5d
Install pachctl
¶
pachctl
is a command-line tool that you can use to interact
with a Pachyderm cluster in your terminal.
-
Run the corresponding steps for your operating system:
- For macOS, run:
brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@2.2
- For a Debian-based Linux 64-bit or Windows 10 or later running on
WSL:
curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v2.2.3/pachctl_2.2.3_amd64.deb && sudo dpkg -i /tmp/pachctl.deb
- For all other Linux flavors:
curl -o /tmp/pachctl.tar.gz -L https://github.com/pachyderm/pachyderm/releases/download/v2.2.3/pachctl_2.2.3_linux_amd64.tar.gz && tar -xvf /tmp/pachctl.tar.gz -C /tmp && sudo cp /tmp/pachctl_2.2.3_linux_amd64/pachctl /usr/local/bin
-
Verify that installation was successful by running
pachctl version --client-only
:pachctl version --client-only
System Response:
COMPONENT VERSION pachctl 2.2.3
If you run
pachctl version
without the flag--client-only
, the command times
out. This is expected behavior because Pachyderm has not been deployed yet (pachd
is not yet running).
Tip
If you are new to Pachyderm, try Pachyderm Shell. This add-on tool suggests pachctl
commands as you type. It will help you learn Pachyderm's main commands faster.
Architecture
A look at Pachyderm high-level architecture diagram
will help you build a mental image of Pachyderm various architectural components.
For information, you can also check what a production setup looks like in this infrastructure diagram.
Install Helm
¶
Follow Helm's installation guide.
Deploy Pachyderm¶
When done with the Prerequisites, deploy Pachyderm on your local cluster by following these steps. Your default installation comes with Console (Pachyderm's Web UI).
Additionally, for JupyterLab users, you can install Pachyderm JupyterLab Mount Extension on your local Pachyderm cluster to experience Pachyderm from your familiar notebooks.
Note that you can run both Console and JupyterLab on your local installation.
-
Get the Repo Info:
helm repo add pach https://helm.pachyderm.com helm repo update
-
Install Pachyderm:
Request an Enterprise Key
To request a FREE trial enterprise license key, click here.
This command will install Pachyderm's latest available GA version with Console CE.
helm install --wait --timeout 10m pachd pach/pachyderm --set deployTarget=LOCAL
Add the following --set console.enabled=false
to the command above to install without Console.
This command will unlock your enterprise features and install Console Enterprise. Note that Console Enterprise requires authentication. By default, we create a default mock user (username:admin
, password: password
) to authenticate to Console without having to connect your Identity Provider.
- Create a
license.txt
file in which you paste your Enterprise Key. -
Then, run the following helm command to install Pachyderm's latest Enterprise Edition:
helm install --wait --timeout 10m pachd pach/pachyderm --set deployTarget=LOCAL --set pachd.enterpriseLicenseKey=$(cat license.txt) --set console.enabled=true
Note
This installation can take several minutes. Run a quick helm list --all
in a separate tab to witness the installation happening in the background.
To uninstall Pachyderm fully
Running helm uninstall pachd
leaves persistent volume claims behind. To wipe your instance clean, run:
helm uninstall pachd
kubectl delete pvc -l suite=pachyderm
See Also
Check Your Install¶
Check the status of the Pachyderm pods by periodically running kubectl get pods
. When Pachyderm is ready for use, all Pachyderm pods must be in the Running status.
Because Pachyderm needs to pull the Pachyderm Docker images from DockerHub, it might take a few minutes for the Pachyderm pods status to change to Running
.
kubectl get pods
System Response: At a very minimum, you should see the following pods (console depends on your choice above):
NAME READY STATUS RESTARTS AGE
pod/console-5b67678df6-s4d8c 1/1 Running 0 2m8s
pod/etcd-0 1/1 Running 0 2m8s
pod/pachd-c5848b5c7-zwb8p 1/1 Running 0 2m8s
pod/pg-bouncer-7b855cb797-jqqpx 1/1 Running 0 2m8s
pod/postgres-0 1/1 Running 0 2m8s
If you see a few restarts on the pachd
nodes, that means that Kubernetes tried to bring up those pods before etcd
or postgres
were ready. Therefore, Kubernetes restarted those pods. Re-run kubectl get pods
Connect 'pachctl' To Your Cluster¶
Assuming your pachd
is running as shown above, the easiest way to connect pachctl
to your local cluster is to use the port-forward
command.
-
To connect to your new Pachyderm instance, run:
pachctl config import-kube local --overwrite pachctl config set active-context local
-
Then:
Background this process in a new tab of your terminal.pachctl port-forward
Verify that pachctl
and your cluster are connected.¶
pachctl version
System Response:
COMPONENT VERSION
pachctl 2.2.3
pachd 2.2.3
If You Have Deployed Pachyderm Community Edition¶
You are ready! To connect to your Console (Pachyderm UI), point your browser to localhost:4000
.
If You Have Deployed Pachyderm Enterprise¶
-
To connect to your Console (Pachyderm UI), point your browser to
localhost:4000
and authenticate using the mock User (username:admin
, password:password
). -
Alternatively, you can connect to your Console (Pachyderm UI) directly by pointing your browser to port
4000
on your minikube IP (runminikube ip
to retrieve minikube's external IP) or docker desktop IPhttp://<dockerDesktopIdaddress-or-minikube>:4000/
then authenticate using the mock User (username:admin
, password:password
). -
To use
pachctl
, you need to runpachctl auth login
then authenticate again (to Pachyderm this time) with the mock User (username:admin
, password:password
).
NOTEBOOKS USERS: Install Pachyderm JupyterLab Mount Extension¶
Note
You do not need a local Pachyderm cluster already running to install Pachyderm JupyterLab Mount Extension. However, you need a running cluster to connect your Mount Extension to; therefore, we recommend that you install Pachyderm locally first.
-
To install JupyterHub and the Mount Extension on your local cluster, run the following commands. You will be using our default
jupyterhub-ext-values.yaml
:helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/ helm repo update
helm upgrade --cleanup-on-fail \ --install jupyter jupyterhub/jupyterhub \ --values https://raw.githubusercontent.com/pachyderm/pachyderm/2.2.x/etc/helm/examples/jupyterhub-ext-values.yaml
-
Check the state of your pods
kubectl get all
. Look for the podshub-xx
andproxy-xx
; their state should beRunning
. Run the command a couple times if necessary. The image takes some time to pull. See the example below:pod/hub-6fb9bb5847-ndfwc 1/1 Running 0 22h pod/proxy-57db95fd89-l5pd5 1/1 Running 0 22h
-
Once your pods are up, in your terminal, run :
kubectl port-forward svc/proxy-public 8888:80
Then
Note the returned ip address. You will need this cluster IP in a next step.kubectl get services | grep -w "pachd " | awk '{print $3}'
-
Point your browser to
http://localhost:8888
, and authenticate using any mock User (username:admin
, password:password
will do). -
Now that you are in, click on Pachyderm's Mount Extension icon on the left of your JupyterLab to connect your JupyterLab to your Pachyderm cluster.
Enter
grpc://<your-pachd-cluster-ip-from-the-previous-step>:30650
to login. -
If Pachyderm was deployed with Enterprise, you will be prompted to login again. Use the same mock User (username:
admin
, password:password
). -
Verify that your JupyterLab Extension is connected to your cluster. From the cell of a notebook, run:
!pachctl version
COMPONENT VERSION pachctl 2.2.3 pachd 2.2.3
Try our Notebook examples!
Make sure to check our data science notebook examples running on Pachyderm, from a market sentiment NLP implementation using a FinBERT model to pipelines training a regression model on the Boston Housing Dataset.
Next Steps¶
Complete the Beginner Tutorial to learn the basics of Pachyderm, such as adding data to a repository and building analysis pipelines.
See Also