Defer Processing via Staging Branch
Learn how to defer processing of data by using a staging branch in an input repository.
When you want to load data into Pachyderm without triggering a pipeline, you can upload it to a staging branch and then submit accumulated changes in one batch by re-pointing the
HEAD of your
master branch to a commit in the staging branch. Let’s see how this works.
How to Use a Staging Branch #
Create a repository. For example,
pachctl create repo data
pachctl create branch data@master
View the created branch:
pachctl list commit data
REPO BRANCH COMMIT FINISHED SIZE ORIGIN DESCRIPTION data master 8090bfb4d4fe44158eac12199c37a591 About a minute ago 0B AUTO
Pachyderm automatically created an empty
HEADcommit on the new branch, as you can see from the
0B(zero-byte) size and
Commit a file to a staging branch:
pachctl put file data@staging -f <file>
Pachyderm automatically creates the
stagingbranch. Your repo now has 2 branches,
master. In this example, the
stagingname is used, but you can name the branch as you want – and have as many staging branches as you need.
Verify that the branches were created:
pachctl list branch data
BRANCH HEAD TRIGGER staging f3506f0fab6e483e8338754081109e69 - master 8090bfb4d4fe44158eac12199c37a591 -
masterbranch still has the same
HEADcommit. No jobs have started to process the new file, because there are no pipelines that take
stagingas inputs. You can continue to commit to
stagingto add new data to the branch, and the pipeline will not process anything.
When you are ready to process the data, update the
masterbranch to point it to the head of the staging branch:
pachctl create branch data@master --head staging
List your branches to verify that the master branch’s
HEADcommit has changed:
pachctl list branch data
staging f3506f0fab6e483e8338754081109e69 master f3506f0fab6e483e8338754081109e69
stagingbranches now have the same
HEADcommit. This means that your pipeline has data to process.
Verify that the pipeline has new jobs:
pachctl list job data@f3506f0fab6e483e8338754081109e69 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE f3506f0fab6e483e8338754081109e69 data 32 seconds ago Less than a second 0 6 + 0 / 6 108B 24B success
You should see one job that Pachyderm created for all the changes you have submitted to the
stagingbranch, with the same ID. While the commits to the
stagingbranch are ancestors of the current
master, they were never the actual
masterthemselves, so they do not get processed. This behavior works for most of the use cases because commits in Pachyderm are generally additive, so processing the HEAD commit also processes data from previous commits.