You can use Pachyderm to build an automated machine learning pipeline that trains a model on a CSV file.

Before You Start

  • You must have Pachyderm installed and running on your cluster
  • You should have already completed the Standard ML Pipeline tutorial
  • You must be familiar with jsonnet
  • This tutorial assumes your active context is localhost:80

Tutorial

Our Docker image’s user code for this tutorial is built on top of the python:3.7-slim-buster base image. It also uses the mljar-supervised package to perform automated feature engineering, model selection, and hyperparameter tuning, making it easy to train high-quality machine learning models on structured data.

1. Create a Project & Input Repo

2. Create a Jsonnet Pipeline

The model automatically starts training. Once complete, the trained model and evaluation metrics are output to the AutoML output repo.

3. Upload the Dataset

Repeat the previous step as many times as you want. Each time, Pachyderm automatically retrains the model and outputs the new model and evaluation metrics to the AutoML output repo.


User Code Assets

The Docker image used in this tutorial was built with the following assets: