Run Commands

Multi-Pipeline DAG

Learn how to build a DAG with multiple pipelines.

In this tutorial, we’ll build a multi-pipeline DAG to train a regression model on housing market data to predict the value of homes in Boston. This tutorial builds on the skills learned from the previous tutorials, (Standard ML Pipeline and AutoML Pipeline.

Before You Start #

Tutorial #

Our Docker image’s user code for this tutorial is built on top of the civisanalytics/datascience-python base image, which includes the necessary dependencies. It uses pandas to import the structured dataset and the scikit-learn library to train the model.

Each pipeline in this tutorial executes a Python script, versions the artifacts (datasets, models, etc.), and gives you a full lineage of the model. Once it is set up, you can change, add, or remove data and Pachyderm will automatically keep everything up to date, creating data splits, computing data analysis metrics, and training the model.

1. Create an Input Repo #

  1. Create a project named multipipeline-tutorial.

    pachctl create project multipipeline-tutorial
  2. Set the project as current.

    pachctl config update context --project multipipeline-tutorial
  3. Create a new data repository called csv_data where we will put our dataset.

    pachctl create repo csv_data

2. Create the Pipelines #

We’ll deploy each stage in our ML process as a Pachyderm pipeline. Organizing our work into pipelines allows us to keep track of artifacts created in our ML development process. We can extend or add pipelines at any point to add new functionality or features, while keeping track of code and data changes simultaneously.

1. Data Analysis Pipeline #

The data analysis pipeline creates a pair plot and a correlation matrix showing the relationship between features. By seeing what features are positively or negatively correlated to the target value (or each other), it can helps us understand what features may be valuable to the model.

  1. Create a file named data_analysis.json with the following contents:

    {
     "pipeline": {
         "name": "data_analysis"
     },
     "description": "Data analysis pipeline that creates pairplots and correlation matrices for csv files.",
     "input": {
         "pfs": {
             "glob": "/*",
             "repo": "csv_data"
         }
     },
     "transform": {
         "cmd": [
             "python", "data_analysis.py",
             "--input", "/pfs/csv_data/",
             "--target-col", "MEDV",
             "--output", "/pfs/out/"
         ],
         "image": "jimmywhitaker/housing-prices-int:dev0.2"
     }
    }
  2. Save the file.

  3. Create the pipeline.

    pachctl create pipeline -f /path/to/data_analysis.json

2. Split Pipeline #

Split the input csv files into train and test sets. As we new data is added, we will always have access to previous versions of the splits to reproduce experiments and test results.

Both the split pipeline and the data_analysis pipeline take the csv_data as input but have no dependencies on each other. Pachyderm is able to recognize this. It can run each pipeline simultaneously, scaling each horizontally.

  1. Create a file named split.json with the following contents:

    {
     "pipeline": {
         "name": "split"
     },
     "description": "A pipeline that splits tabular data into training and testing sets.",
     "input": {
         "pfs": {
             "glob": "/*",
             "repo": "csv_data"
         }
     },
     "transform": {
         "cmd": [
             "python", "split.py",
             "--input", "/pfs/csv_data/",
             "--test-size", "0.1",
             "--output", "/pfs/out/"
         ],
         "image": "jimmywhitaker/housing-prices-int:dev0.2"
     }
    }
  2. Save the file.

  3. Create the pipeline.

    pachctl create pipeline -f /path/to/split.json

3. Regression Pipeline #

To train the regression model using scikit-learn. In our case, we will train a Random Forest Regressor ensemble. After splitting the data into features and targets (X and y), we can fit the model to our parameters. Once the model is trained, we will compute our score (r^2) on the test set.

After the model is trained we output some visualizations to evaluate its effectiveness of it using the learning curve and other statistics.

  1. Create a file named regression.json with the following contents:

    {
     "pipeline": {
         "name": "regression"
     },
     "description": "A pipeline that trains and tests a regression model for tabular.",
     "input": {
         "pfs": {
             "glob": "/*/",
             "repo": "split"
         }
     },
     "transform": {
         "cmd": [
             "python", "regression.py",
             "--input", "/pfs/split/",
             "--target-col", "MEDV",
             "--output", "/pfs/out/"
         ],
         "image": "jimmywhitaker/housing-prices-int:dev0.2"
     }
    }
  2. Save the file.

  3. Create the pipeline.

    pachctl create pipeline -f /path/to/regression.json

3. Upload the Dataset #

  1. Download our first example data set, housing-simplified-1.csv.
  2. Add the data to your repo.
    pachctl put file csv_data@master:housing-simplified.csv -f /path/to/housing-simplified-1.csv

4. Download the Results #

Once the pipeline has finished, download the results.

pachctl get file regression@master:/ --recursive --output .

5. Update the Dataset #

  1. Download our second example data set, housing-simplified-2.csv.
  2. Add the data to your repo.
    pachctl put file csv_data@master:housing-simplified.csv -f /path/to/housing-simplified-2.csv

6. Inspect the Data #

We can use the diff command and ancestry syntax to see what has changed between the two versions of the dataset.

pachctl diff file csv_data@master csv_data@master^

Bonus Step: Rolling Back #

If you need to roll back to a previous dataset commit, you can do so with the create branch command and ancestry syntax.

pachctl create branch csv_data@master --head csv_data@master^

User Code Assets #

The Docker image used in this tutorial was built with the following assets:

Assets: