Beginner Tutorial

Learn how to quickly ingest photos, trace their outlines, and output a collage using the transformed data in Pachyderm.

March 24, 2023

Before You Start #

Context #

Pachyderm creates a Kubernetes cluster that you interact with using either the pachctl CLI or through Console, a GUI.

Within the cluster, you can create projects that contain repos and pipelines. Pipelines can be single-stage or multi-stage; multi-stage pipelines are commonly referred to as DAGs.

Tutorial: Image processing with OpenCV #

In this tutorial you’ll create an image edge detection pipeline that processes new data as it is added and outputs the results.

1. Create a Project #

To keep our work organized, we’re going to create a project named openCV and set it to our currently active context.

pachctl create project openCV
pachctl config update context --project openCV

You can always check to confirm which project has been set to your context by running the following commands:

# prints current context name (local) 
pachctl config get active-context 

# prints local's context details
pachctl config get context local 

# {
#   "source": "IMPORTED",
#   "cluster_name": "docker-desktop",
#   "auth_info": "docker-desktop",
#   "cluster_deployment_id": "dev",
#   "project": "openCV"
# }

2. Create a Repo #

Repos should be dedicated to a single source of data such as log messages from a particular service, a users table, or training data.

pachctl create repo images

You can verify that the repository was created by running the following command:

pachctl list repo

# images 4 seconds ago ≤ 0B          [repoOwner]

3. Add Data #

In Pachyderm, you write data to an explicit commit. Commits are immutable snapshots of your data which give Pachyderm its version control properties. You can add, remove, or update files in a given commit.

Upload an Image File #

We’re going to use the pachctl put file command, along with the -f flag, to upload an image.

pachctl put file images@master:liberty.png -f

pachctl put file automatically starts and finishes a commit for you so you can add files more easily.


If you want to add many files over a period of time, you can do pachctl start commit and pachctl finish commit yourself.

You can confirm the commit using the following command:

pachctl list commit images

# REPO          BRANCH COMMIT                           FINISHED       SIZE     ORIGIN DESCRIPTION
# openCV/images master 37559e89ed0c4a0cb354649524050851 10 seconds ago 57.27KiB USER  

You can also view the filename in the commit using the following command:

pachctl list file images@master

# NAME         TYPE SIZE     
# /liberty.png file 57.27KiB

View Image #

In Terminal #

Operating System:
In Console #

In your Console, click on the images repo to visualize its commit and inspect its file:

Console images liberty

4. Create a Pipeline #

Now that you have some data in your repo, it is time to do something with it using a pipeline.

Pipelines process data and are defined using a JSON pipeline specification. For this tutorial, we’ve already created the spec for you.

Review Pipeline Spec #

Take a moment to review the details of the provided pipeline spec so that you’ll know how to create one on your own in the future.

  // The `pipeline` section contains a `name` for identification; this name is also used to create a corresponding output repo.
  "pipeline": {
    "name": "edges"
  "description": "A pipeline that performs image edge detection by using the OpenCV library.",
  // The `transform` section allows you to specify the docker `image` you want to use (`pachyderm/opencv:1.0`)and the `cmd` that defines the entry point (``). 
  "transform": {
    "cmd": [ "python3", "/" ],
    "image": "pachyderm/opencv:1.0"
  // The input section specifies repos visible to the running pipeline, and how to process the data from the repos. 
  // Commits to these repos trigger the pipeline to create new processing jobs. In this case, `images` is the repo, and `/*` is the glob pattern.
  "input": {
    "pfs": {
      "repo": "images",
      // The glob pattern defines how the input data will be transformed into datum if you want to distribute computation. `/*` means that each file can be processed individually.
      "glob": "/*"

The following extract is the Python code run in this pipeline:

import cv2
import numpy as np
from matplotlib import pyplot as plt
import os

# make_edges reads an image from /pfs/images and outputs the result of running
# edge detection on that image to /pfs/out. Note that /pfs/images and
# /pfs/out are special directories that Pachyderm injects into the container.
def make_edges(image):
   img = cv2.imread(image)
   tail = os.path.split(image)[1]
   edges = cv2.Canny(img,100,200)
   plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')

# walk /pfs/images and call make_edges on every file found
for dirpath, dirs, files in os.walk("/pfs/images"):
   for file in files:
       make_edges(os.path.join(dirpath, file))

The code simply walks over all the images in /pfs/images, performs edge detection, and writes the result to /pfs/out.


Your code must write out to /pfs/out (see the function make_edges(image) above). Pachyderm gathers data written to /pfs/out, versions it, and maps it to the pipeline’s output repo of the same name.

Now, let’s create the pipeline in Pachyderm:

pachctl create pipeline -f

Again, check the end result in your Console:

Console edges pipeline

What Happens When You Create a Pipeline #

When you create a pipeline, Pachyderm transforms all current and future data added to your input repo using your user code. This process is known as a job. The initial job downloads the specified Docker image that is used for all future jobs.

  1. View the job:
pachctl list job

# ID                               SUBJOBS PROGRESS CREATED       MODIFIED
# 23378d899d3d45738f55df3809841145 1       ▇▇▇▇▇▇▇▇ 5 seconds ago 5 seconds ago
  1. Check the state of your pipeline:
pachctl list pipeline

# edges 1       images:/* 2 minutes ago running / success A pipeline that performs image edge detection by using the OpenCV library.
  1. List your repositories:
pachctl list repo

# edges  10 minutes ago ≤ 22.22KiB    [repoOwner]  Output repo for pipeline edges.
# images 3 hours ago    ≤ 57.27KiB    [repoOwner]

Reading the Output #

We can view the output data from the edges repo in the same fashion that we viewed the input data.

Operating System:

Console edges liberty

Processing More Data #

  1. Create two new commits:
pachctl put file images@master:AT-AT.png -f
pachctl put file images@master:kitten.png -f
  1. View the list of jobs that have started:
pachctl list job

# ID                               SUBJOBS PROGRESS CREATED        MODIFIED
# 1c1a9d7d36944eabb4f6f14ebca25bf1 1       ▇▇▇▇▇▇▇▇ 31 seconds ago 31 seconds ago
# fe5c4f70ac4347fd9c5934f0a9c44651 1       ▇▇▇▇▇▇▇▇ 47 seconds ago 47 seconds ago
# 23378d899d3d45738f55df3809841145 1       ▇▇▇▇▇▇▇▇ 12 minutes ago 12 minutes ago
  1. View the output data:
Operating System:

5. Create a DAG #

Currently, we’ve only set up a single-stage pipeline. Let’s create a multi-stage pipeline (also known as a DAG) by adding a montage pipeline that takes our both original and edge-detected images and arranges them into a single montage of images:


Below is the pipeline spec for this new pipeline:

  "pipeline": {
    "name": "montage"
  "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.",
  "input": {
    "cross": [ {
      "pfs": {
        "glob": "/",
        "repo": "images"
      "pfs": {
        "glob": "/",
        "repo": "edges"
    } ]
  "transform": {
    "cmd": [ "sh" ],
    "image": "v4tech/imagemagick",
    "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ]

This montage pipeline spec is similar to our edges pipeline except for the following differences:

In the montage pipeline we are combining our multiple input data repositories using a cross pattern. This cross pattern creates a single pairing of our input images with our edge detected images.

  1. Create the montage pipeline:
pachctl create pipeline -f
  1. View the triggered jobs:
pachctl list job

#  ID                               SUBJOBS PROGRESS CREATED        MODIFIED
# 01e0c8040e18429daf7f67ce34c3a5d7 1       ▇▇▇▇▇▇▇▇ 11 seconds ago 11 seconds ago
# 1c1a9d7d36944eabb4f6f14ebca25bf1 1       ▇▇▇▇▇▇▇▇ 12 minutes ago 12 minutes ago
# fe5c4f70ac4347fd9c5934f0a9c44651 1       ▇▇▇▇▇▇▇▇ 12 minutes ago 12 minutes ago
# 23378d899d3d45738f55df3809841145 1       ▇▇▇▇▇▇▇▇ 24 minutes ago 24 minutes ago
  1. View the generated montage image:
Operating System: