Deferred Processing of Data¶
While a Pachyderm pipeline is running, it processes any new data that you commit to its input branch. However, in some cases, you want to commit data more frequently than you want to process it.
Because Pachyderm pipelines do not reprocess the data that has already been processed, in most cases, this is not an issue. But, some pipelines might need to process everything from scratch. For example, you might want to commit data every hour, but only want to retrain a machine learning model on that data daily because it needs to train on all the data from scratch.
In these cases, you can leverage a massive performance benefit from deferred processing. This section covers how to achieve that and control what gets processed.
Pachyderm controls what is being processed by using the filesystem, rather than at the pipeline level. Although pipelines are inflexible, they are simple and always try to process the data at the heads of their input branches. In contrast, the filesystem is very flexible and gives you the ability to commit data in different places and then efficiently move and rename the data so that it gets processed when you want.
Configure a Staging Branch in an Input repository¶
When you want to load data into Pachyderm without triggering a pipeline, you can upload it to a staging branch and then submit accumulated changes in one batch by re-pointing the
HEAD of your
master branch to a commit in the staging branch.
Although, in this section, the branch in which you consolidate changes is called
staging, you can name it as you like. Also, you can have multiple staging branches. For example,
dev2, and so on.
In the example below, the repository that is created called
To configure a staging branch, complete the following steps:
Create a repository. For example,
$ pachctl create repo data
$ pachctl create branch data@master
View the created branch:
$ pachctl list branch data BRANCH HEAD master -
HEADmeans that nothing has yet been committed into this branch. When you commit data to the
masterbranch, the pipeline immediately starts a job to process it. However, if you want to commit something without immediately processing it, you need to commit it to a different branch.
Commit a file to the staging branch:
$ pachctl put file data@staging -f <file>
Pachyderm automatically creates the
stagingbranch. Your repo now has 2 branches,
master. In this example, the
stagingname is used, but you can name the branch as you want.
Verify that the branches were created:
$ pachctl list branch data BRANCH HEAD staging f3506f0fab6e483e8338754081109e69 master -
masterbranch still does not have a
HEADcommit, but the new branch,
staging, does. There still have been no jobs, because there are no pipelines that take
stagingas inputs. You can continue to commit to
stagingto add new data to the branch, and the pipeline will not process anything.
When you are ready to process the data, update the
masterbranch to point it to the head of the staging branch:
$ pachctl create branch data@master --head staging
List your branches to verify that the master branch has a
$ pachctl list branch staging f3506f0fab6e483e8338754081109e69 master f3506f0fab6e483e8338754081109e69
stagingbranches now have the same
HEADcommit. This means that your pipeline has data to process.
Verify that the pipeline has new jobs:
$ pachctl list job ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 061b0ef8f44f41bab5247420b4e62ca2 test 32 seconds ago Less than a second 0 6 + 0 / 6 108B 24B success
You should see one job that Pachyderm created for all the changes you have submitted to the
stagingbranch. While the commits to the
stagingbranch are ancestors of the current
master, they were never the actual
masterthemselves, so they do not get processed. This behavior works for most of the use cases because commits in Pachyderm are generally additive, so processing the HEAD commit also processes data from previous commits.
Process Specific Commits¶
Sometimes you want to process specific intermediary commits that are not in the
HEAD of the branch. To do this, you need to set
master to have these commits as
HEAD. For example, if you submitted ten commits in the
staging branch and you want to process the seventh, third, and most recent commits, you need to run the following commands respectively:
$ pachctl create branch data@master --head staging^7 $ pachctl create branch data@master --head staging^3 $ pachctl create branch data@master --head staging
When you run the commands above, Pachyderm creates a job for each of the commands one after another. Therefore, when one job is completed, Pachyderm starts the next one. To verify that Pachyderm created jobs for these commands, run
pachctl list job.
Change the HEAD of your Branch¶
You can move backward to previous commits as easily as advancing to the latest commits. For example, if you want to change the final output to be the result of processing
staging^1, you can roll back your HEAD commit by running the following command:
$ pachctl create branch data@master --head staging^1
This command starts a new job to process
HEAD commit on your output repo will be the result of processing
staging^1 instead of
Copy Files from One Branch to Another¶
Using a staging branch allows you to defer processing. To use this functionality you need to know your input commits in advance. However, sometimes you want to be able to commit data in an ad-hoc, disorganized manner and then organize it later. Instead of pointing your
master branch to a commit in a staging branch, you can copy individual files from
master. When you run
copy file, Pachyderm only copies references to the files and does not move the actual data for the files around.
To copy files from one branch to another, complete the following steps:
Start a commit:
$ pachctl start commit data@master
$ pachctl copy file data@staging:file1 data@master:file1 $ pachctl copy file data@staging:file2 data@master:file2 ...
Close the commit:
$ pachctl finish commit data@master
Also, you can run
pachctl delete file and
pachctl put file while the commit is open if you want to remove something from the parent commit or add something that is not stored anywhere else.
Deferred Processing in Output Repositories¶
You can perform same deferred processing opertions with data in output repositories. To do so, rather than committing to a
staging branch, configure the
output_branch field in your pipeline specification.
To configure deffered processing in an output repository, complete the following steps:
In the pipeline specification, add the
output_branchfield with the name of the branch in which you want to accumulate your data before processing:
When you want to process data, run:
$ pachctl create-branch pipeline master --head staging
Automate Branch Switching¶
Typically, repointing from one branch to another happens when a certain condition is met. For example, you might want to repoint your branch when you have a specific number of commits, or when the amount of unprocessed data reaches a certain size, or at a specific time interval, such as daily, or other. To configure this functionality, you need to create a Kubernetes application that uses Pachyderm APIs and watches the repositories for the specified condition. When the condition is met, the application switches the Pachyderm branch from