Run Commands

Beginner Tutorial

Learn how to quickly ingest photos, trace their outlines, and output a collage using the transformed data.

Before You Start #

Context #

Pachyderm creates a Kubernetes cluster that you interact with using either the pachctl CLI or through Console, a GUI.

  • pachctl is great for users already experienced with using a CLI.
  • Console is great for beginners and helps with visualizing relationships between projects, repos, and pipelines.

Within the cluster, you can create projects that contain repos and pipelines. Pipelines can be single-stage or multi-stage; multi-stage pipelines are commonly referred to as DAGs.

Tutorial: Image processing with OpenCV #

In this tutorial you’ll create an image edge detection pipeline that processes new data as it is added and outputs the results.

1. Create a Project #

Create Project:

2. Create a Repo #

Repos should be dedicated to a single source of data such as log messages from a particular service, a users table, or training data.

Create Repo:

3. Add Data #

In Pachyderm, you write data to an explicit commit. Commits are immutable snapshots of your data which give Pachyderm its version control properties. You can add, remove, or update files in a given commit.

Add Data:

Bonus: View Image #

You can view the files you’ve uploaded in the Console or in your Terminal.

In Terminal #

Operating System:

In Console #

In your Console, click on the images repo to visualize its commit and inspect its file:

Console images liberty

4. Create a Pipeline #

Now that you have some data in your repo, it is time to do something with it using a pipeline.

Pipelines process data and are defined using a JSON pipeline specification. For this tutorial, we’ve already created the spec for you.

Review Pipeline Spec #

Take a moment to review the details of the provided pipeline spec so that you’ll know how to create one on your own in the future.

  // The `pipeline` section contains a `name` for identification; this name is also used to create a corresponding output repo.
  "pipeline": {
    "name": "edges"
  "description": "A pipeline that performs image edge detection by using the OpenCV library.",
  // The `transform` section allows you to specify the docker `image` you want to use (`pachyderm/opencv:1.0`)and the `cmd` that defines the entry point (``). 
  "transform": {
    "cmd": [ "python3", "/" ],
    "image": "pachyderm/opencv:1.0"
  // The input section specifies repos visible to the running pipeline, and how to process the data from the repos. 
  // Commits to these repos trigger the pipeline to create new processing jobs. In this case, `images` is the repo, and `/*` is the glob pattern.
  "input": {
    "pfs": {
      "repo": "images",
      // The glob pattern defines how the input data will be transformed into datum if you want to distribute computation. `/*` means that each file can be processed individually.
      "glob": "/*"

The following extract is the Python User Code run in this pipeline:

import cv2
import numpy as np
from matplotlib import pyplot as plt
import os

# make_edges reads an image from /pfs/images and outputs the result of running
# edge detection on that image to /pfs/out. Note that /pfs/images and
# /pfs/out are special directories that Pachyderm injects into the container.
def make_edges(image):
   img = cv2.imread(image)
   tail = os.path.split(image)[1]
   edges = cv2.Canny(img,100,200)
   plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')

# walk /pfs/images and call make_edges on every file found
for dirpath, dirs, files in os.walk("/pfs/images"):
   for file in files:
       make_edges(os.path.join(dirpath, file))

The code simply walks over all the images in /pfs/images, performs edge detection, and writes the result to /pfs/out.

  • /pfs/images and /pfs/out are special local directories that Pachyderm creates within the container automatically.
  • Input data is stored in /pfs/<input_repo_name>.

Your code must write out to /pfs/out (see the function make_edges(image) above). Pachyderm gathers data written to /pfs/out, versions it, and maps it to the pipeline’s output repo of the same name.

Now, let’s create the pipeline in Pachyderm:

pachctl create pipeline -f

Again, check the end result in your Console:

Console edges pipeline

What Happens When You Create a Pipeline #

When you create a pipeline, Pachyderm transforms all current and future data added to your input repo using your user code. This process is known as a job. The initial job downloads the specified Docker image that is used for all future jobs.

  1. View the job:
pachctl list job

# ID                               SUBJOBS PROGRESS CREATED       MODIFIED
# 23378d899d3d45738f55df3809841145 1       ▇▇▇▇▇▇▇▇ 5 seconds ago 5 seconds ago
  1. Check the state of your pipeline:
pachctl list pipeline

# edges 1       images:/* 2 minutes ago running / success A pipeline that performs image edge detection by using the OpenCV library.
  1. List your repositories:
pachctl list repo

# edges  10 minutes ago ≤ 22.22KiB    [repoOwner]  Output repo for pipeline edges.
# images 3 hours ago    ≤ 57.27KiB    [repoOwner]

Reading the Output #

We can view the output data from the edges repo in the same fashion that we viewed the input data.

Operating System:

Console edges liberty

Processing More Data #

  1. Create two new commits:
pachctl put file images@master:AT-AT.png -f
pachctl put file images@master:kitten.png -f
  1. View the list of jobs that have started:
pachctl list job

# ID                               SUBJOBS PROGRESS CREATED        MODIFIED
# 1c1a9d7d36944eabb4f6f14ebca25bf1 1       ▇▇▇▇▇▇▇▇ 31 seconds ago 31 seconds ago
# fe5c4f70ac4347fd9c5934f0a9c44651 1       ▇▇▇▇▇▇▇▇ 47 seconds ago 47 seconds ago
# 23378d899d3d45738f55df3809841145 1       ▇▇▇▇▇▇▇▇ 12 minutes ago 12 minutes ago
  1. View the output data:
Operating System:

5. Create a DAG #

Currently, we’ve only set up a single-stage pipeline. Let’s create a multi-stage pipeline (also known as a DAG) by adding a montage pipeline that takes our both original and edge-detected images and arranges them into a single montage of images:


Below is the pipeline spec for this new pipeline:

  "pipeline": {
    "name": "montage"
  "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.",
  "input": {
    "cross": [ {
      "pfs": {
        "glob": "/",
        "repo": "images"
      "pfs": {
        "glob": "/",
        "repo": "edges"
    } ]
  "transform": {
    "cmd": [ "sh" ],
    "image": "v4tech/imagemagick",
    "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ]

This montage pipeline spec is similar to our edges pipeline except for the following differences:

  • We are using a different Docker image that has imagemagick installed.
  • We are executing a sh command with stdin instead of a python script in the pipeline’s transform section.
  • We have multiple input data repositories (images and edges).

In the montage pipeline we are combining our multiple input data repositories using a cross pattern. This cross pattern creates a single pairing of our input images with our edge detected images.

  1. Create the montage pipeline:
pachctl create pipeline -f
  1. View the triggered jobs:
pachctl list job

#  ID                               SUBJOBS PROGRESS CREATED        MODIFIED
# 01e0c8040e18429daf7f67ce34c3a5d7 1       ▇▇▇▇▇▇▇▇ 11 seconds ago 11 seconds ago
# 1c1a9d7d36944eabb4f6f14ebca25bf1 1       ▇▇▇▇▇▇▇▇ 12 minutes ago 12 minutes ago
# fe5c4f70ac4347fd9c5934f0a9c44651 1       ▇▇▇▇▇▇▇▇ 12 minutes ago 12 minutes ago
# 23378d899d3d45738f55df3809841145 1       ▇▇▇▇▇▇▇▇ 24 minutes ago 24 minutes ago
  1. View the generated montage image:
Operating System: