In this tutorial, we’ll build a scalable inference pipeline for breast cancer detection using task parallelism.

Before You Start

Tutorial

Our Docker image’s user code for this tutorial is built on top of the pytorch/pytorch base image, which includes necessary dependencies. The underlying code and pre-trained breast cancer detection model comes from this repo, developed by the Center of Data Science and Department of Radiology at NYU. Their original paper can be found here.

1. Create an Input Repo

2. Create CPU Pipelines

In task parallelism, we separate out the CPU-based preprocessing and GPU-related tasks, saving us cloud costs when scaling. By separating inference into multiple tasks, each task pipeline can be updated independently, allowing ease of model deployment and collaboration.

We can split the run.sh script used in the previous tutorial (Data Parallelism Pipeline) into 5 separate processing steps (4 already defined in the script + a visualization step) which will become Pachyderm pipelines, so each can be scaled separately.

Crop Pipeline

Extract Centers Pipeline

3. Create GPU Pipelines

Generate Heatmaps Pipeline

Classify Pipeline

4. Upload Dataset

  1. Open or download this github repo.

    gh repo clone pachyderm/docs-content
  2. Navigate to this tutorial.

    cd content/products/mldm/latest/build-dags/tutorials/task-parallelism
  3. Upload the sample_data and models folders to your repos.


User Code Assets

The Docker image used in this tutorial was built with the following assets: