Skip to content

If you are using Pachyderm version 1.9.7 or earlier, go to the documentation archive.

Individual Developer Workflow

A typical Pachyderm workflow involves multiple iterations of experimenting with your code and pipeline specs.

Info

Before you read this section, make sure that you understand basic Pachyderm pipeline concepts described in Concepts.

How it works

Working with Pachyderm includes multiple iterations of the following steps:

Developer workflow

Step 1: Write Your Analysis Code

Because Pachyderm is completely language-agnostic, the code that is used to process data in Pachyderm can be written in any language and can use any libraries of choice. Whether your code is as simple as a bash command or as complicated as a TensorFlow neural network, it needs to be built with all the required dependencies into a container that can run anywhere, including inside of Pachyderm. See Examples.

Your code does not have to import any special Pachyderm functionality or libraries. However, it must meet the following requirements:

  • Read files from a local file system. Pachyderm automatically mounts each input data repository as /pfs/<repo_name> in the running containers of your Docker image. Therefore, the code that you write needs to read input data from this directory, similar to any other file system.

    Because Pachyderm automatically spreads data across parallel containers, your analysis code does not have to deal with data sharding or parallelization. For example, if you have four containers that run your Python code, Pachyderm automatically supplies ¼ of the input data to /pfs/<repo_name> in each running container. These workload balancing settings can be adjusted as needed through Pachyderm tunable parameters in the pipeline specification.

  • Write files into a local file system, such as saving results. Your code must write to the /pfs/out directory that Pachyderm mounts in all of your running containers. Similar to reading data, your code does not have to manage parallelization or sharding.

Step 2: Build Your Docker Image

When you create a Pachyderm pipeline, you need to specify a Docker image that includes the code or binary that you want to run. Therefore, every time you modify your code, you need to build a new Docker image, push it to your image registry, and update the image tag in the pipeline spec. This section describes one way of building Docker images, but if you have your own routine, feel free to apply it.

To build an image, you need to create a Dockerfile. However, do not use the CMD field in your Dockerfile to specify the commands that you want to run. Instead, you add them in the cmd field in your pipeline specification. Pachyderm runs these commands inside the container during the job execution rather than relying on Docker to run them. The reason is that Pachyderm cannot execute your code immediately when your container starts, so it runs a shim process in your container instead, and then, it calls your pipeline specification's cmd from there.

After building your image, you need to upload the image into a public or private image registry, such as DockerHub or other.

Alternatively, you can use the Pachyderm's built-in functionality to tag, build, and push images by running the pachctl update pipeline command with the --build and --push-images flags. For more information, see Update a pipelines.

Note

The Dockerfile example below is provided for your reference only. Your Dockerfile might look completely different.

To build a Docker image, complete the following steps:

  1. If you do not have a registry, create one with a preferred provider. If you decide to use DockerHub, follow the Docker Hub Quickstart to create a repository for your project.
  2. Create a Dockerfile for your project. See the OpenCV example.
  3. Log in to an image registry.

    • If you use DockerHub, run:

      docker login --username=<dockerhub-username> --password=<dockerhub-password> <dockerhub-fqdn>
      
  4. Build a new image from the Dockerfile by specifying a tag:

    docker build -t <IMAGE>:<TAG> .
    
  5. Push your image to your image registry.

    • If you use DockerHub, run:

      docker push <image>:tag
      

For more information about building Docker images, see Docker documentation.

Step 3: Load Your Data to Pachyderm

You need to add your data to Pachyderm so that your pipeline runs your code against it. You can do so by using one of the following methods:

  • By using the pachctl put file command
  • By using a special type of pipeline, such as a spout or cron
  • By using one of the Pachyderm's language clients
  • By using a compatible S3 client
  • By using the Pachyderm UI (Enterprise version or free trial)

For more information, see Load Your Data Into Pachyderm.

Step 4: Create a Pipeline

Pachyderm's pipeline specifications store the configuration information about the Docker image and code that Pachyderm should run. Pipeline specifications are stored in JSON format. As soon as you create a pipeline, Pachyderm immediately spins a pod or pods on a Kubernetes worker node in which the pipeline code runs. By default, after the pipeline finishes running, the pods continue to run while waiting for the new data to be committed into the Pachyderm input repository. You can configure this parameter, as well as many others, in the pipeline specification.

A standard pipeline specification must include the following parameters:

  • name
  • transform
  • parallelism
  • input

Note

Some special types of pipelines, such as a spout pipeline, do not require you to specify all of these parameters.

You can store your pipeline locally or in a remote location, such as a GitHub repository.

To create a Pipeline, complete the following steps:

  1. Create a pipeline specification. Here is an example of a pipeline spec:

    # my-pipeline.json
    {
      "pipeline": {
        "name": "my-pipeline"
      },
      "transform": {
        "image": "my-pipeline-image",
        "cmd": ["/binary", "/pfs/data", "/pfs/out"]
      },
      "input": {
          "pfs": {
            "repo": "data",
            "glob": "/*"
          }
      }
    }
    
  2. Create a Pachyderm pipeline from the spec:

    pachctl create pipeline -f my-pipeline.json
    

    You can specify a local file or a file stored in a remote location, such as a GitHub repository. For example, https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json.

Last updated: April 28, 2020