Beginner Tutorial:


How Pachyderm Works

Pachyderm is deployed within a Kubernetes cluster to manage and version your data using projects, input repositories, pipelines, datums and output repositories. A project can house many repositories and pipelines, and when a pipeline runs a data transformation job it chunks your inputs into datums for processing.

The number of datums is determined by the glob pattern defined in your pipeline specification; if the shape of your glob pattern encompasses all inputs, it will process one datum; if the shape of your glob pattern encompasses each input individually, it will process one datum per file in the input, and so on.

The end result of your data transformation should always be saved to /pfs/out. The contents of /pfs/out are automatically made accessible from the pipeline’s output repository by the same name. So all files saved to /pfs/out for a pipeline named foo are accessible from the foo output repository.

Pipelines combine to create DAGs, and a DAG can be comprised of just one pipeline. Don’t worry if this sounds confusing! We’ll walk you through the process step-by-step.

How to Interact with Pachyderm

You can interact your Pachyderm cluster using the PachCTL CLI or through Console, a GUI.

  • PachCTL is great for users already experienced with using a CLI.
  • Console is great for beginners and helps with visualizing relationships between projects, repos, and pipelines.

Before You Start

Part 1: Beginner Overview

In this tutorial, we’ll walk you through how to use Pachyderm to process images and videos using OpenCV. OpenCV is a popular open-source computer vision library that can be used to perform image processing and video analysis.

This DAG has 6 steps with the goal of intaking raw photos and video content, drawing edge-detected traces, and outputting a comparison collage of the original and processed images:

  1. Convert videos to MP4 format
  2. Extract frames from videos
  3. Trace the outline of each frame and standalone image
  4. Create .gifs from the traced video frames
  5. Re-shuffle the content so it is organized by “original” and “traced” images
  6. Build a comparison collage using a static HTML page