Pachyderm provides users with a simple way to follow a change throughout their DAG (i.e., traverse
Pachyderm associates a commit ID to each new commit. You can quickly check this new commit by running
pachctl list commit repo@branch. All resulting downstream commits and jobs in your DAG will then share that same ID (Global Identifier).
The commits and jobs sharing the same ID represent a logically-related set of objects. The ID of a commit is also:
- the ID of any commits created along due to provenance relationships,
- and the ID of any jobs triggered by the creation of those commits.
This ability to track down related commits and jobs with one global identifier brought the need to introduce a new scope to our original concepts of job and commit. The nuance in the scope of a commit or a job ( "Global" or "Local") gives the term two possible meanings.
|CONCEPT ||SCOPE ||DEFINITION |
|Commit ||Global ||A commit with |
global scope (global commit) represents the set of all provenance-dependent commits sharing the same ID.
You can retrieve a global commit by running
pachctl list commit <commitID>.
|Commit ||Local Repo ||The same term of |
commit, applied to the more focused scope of a repo (
pachctl list commit <repo>@<commitID> or
pachctl list commit <repo>@<branch>=<commitID>), represents "the Git-like" record of one commit in a single branch of a repository's file system.
|Job ||Global ||A job with |
global scope (global job) is the set of jobs triggered due to commits in a global commit.
You can retrieve a global job by running
pachctl list job <commitID>.
|Job ||Local Pipeline ||Narrowing down the scope to a single pipeline ( |
pachctl list job <pipeline>@<commitID>) shifts the meaning to the execution of a given job in a pipeline of your DAG.
List All Global Commits And Global Jobs
You can list all global commits by running the following command: Each global commit displays how many (sub) commits it is made of.
Similarly, if you run the equivalent command for global jobs: you will notice that the job IDs are shared with the global commit IDs.
ID SUBCOMMITS PROGRESS CREATED MODIFIED
1035715e796f45caae7a1d3ffd1f93ca 7 ▇▇▇▇▇▇▇▇ 7 seconds ago 7 seconds ago
28363be08a8f4786b6dd0d3b142edd56 6 ▇▇▇▇▇▇▇▇ 24 seconds ago 24 seconds ago
e050771b5c6f4082aed48a059e1ac203 4 ▇▇▇▇▇▇▇▇ 24 seconds ago 24 seconds ago
For example, in this example, 7 commits and 2 jobs are involved in the changes occured in the global commit ID 1035715e796f45caae7a1d3ffd1f93ca.
ID SUBJOBS PROGRESS CREATED MODIFIED
1035715e796f45caae7a1d3ffd1f93ca 2 ▇▇▇▇▇▇▇▇ 55 seconds ago 55 seconds ago
28363be08a8f4786b6dd0d3b142edd56 1 ▇▇▇▇▇▇▇▇ About a minute ago About a minute ago
e050771b5c6f4082aed48a059e1ac203 1 ▇▇▇▇▇▇▇▇ About a minute ago About a minute ago
The progress bar is equally divided to the number of steps, or pipelines, you have in your DAG. In the example above,
1035715e796f45caae7a1d3ffd1f93ca is two steps. If one of the sub-jobs fails, you will see the progress bar turn red for that pipeline step. To troubleshoot, look into that particular pipeline execution.
List All Commits And Jobs With A Global ID
To list all (sub) commits involved in a global commit:
$ pachctl list commit 1035715e796f45caae7a1d3ffd1f93ca
REPO BRANCH COMMIT FINISHED SIZE ORIGIN DESCRIPTION
images master 1035715e796f45caae7a1d3ffd1f93ca 5 minutes ago 238.3KiB USER
edges.spec master 1035715e796f45caae7a1d3ffd1f93ca 5 minutes ago 244B ALIAS
montage.spec master 1035715e796f45caae7a1d3ffd1f93ca 5 minutes ago 405B ALIAS
montage.meta master 1035715e796f45caae7a1d3ffd1f93ca 4 minutes ago 1.656MiB AUTO
edges master 1035715e796f45caae7a1d3ffd1f93ca 5 minutes ago 133.6KiB AUTO
edges.meta master 1035715e796f45caae7a1d3ffd1f93ca 5 minutes ago 373.9KiB AUTO
montage master 1035715e796f45caae7a1d3ffd1f93ca 4 minutes ago 1.292MiB AUTO
job to list all (sub) jobs linked to your global job ID.
$ pachctl list job 1035715e796f45caae7a1d3ffd1f93ca
For each pipeline execution (sub job) within this global job, Pachyderm shows the time since each sub job started and its duration, the number of datums in the PROGRESS section, and other information. The format of the progress column is
ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE
1035715e796f45caae7a1d3ffd1f93ca montage 5 minutes ago 4 seconds 0 1 + 0 / 1 79.49KiB 381.1KiB success
1035715e796f45caae7a1d3ffd1f93ca edges 5 minutes ago 2 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success
DATUMS PROCESSED + DATUMS SKIPPED / TOTAL DATUMS.
For more information, see Datum Processing States.
The global commit and global job above are the result of a
pachctl put file images@master -i images.txt in the images repo of the open cv example.
The following diagram illustrates the global commit and its various components:
Let's take a look at the origin of each commit.
Inspect the commit ID 1035715e796f45caae7a1d3ffd1f93ca in the
images repo, the repo in which our change (
put file) has originated:
Note that this original commit is of
$ pachctl inspect commit images@1035715e796f45caae7a1d3ffd1f93ca --raw
USER origin (i.e., the result of a user change).
Inspect the following commit 1035715e796f45caae7a1d3ffd1f93ca produced in the output repos of the edges pipeline:
$ pachctl inspect commit edges@1035715e796f45caae7a1d3ffd1f93ca --raw
Note that the origin of the commit is of kind
AUTO as it has been trigerred by the arrival of a commit in the upstream repo
The same origin (
AUTO ) applies to the commits sharing that same ID in the
montage output repo as well as
montage.meta system repos.
AUTO commits, notice a set of
ALIAS commits in
The version of each pipeline within their respective
$ pachctl inspect commit edges.spec@336f02bdbbbb446e91ba27d2d2b516c6 --raw
.spec repos are neither the result of a user change, nor of an automatic change. They have, however, contributed to the creation of the previous
AUTO commits. To make sure that we have a complete view of all the data and pipeline versions involved in all the commits resulting from the initial
put file, their version is kept as
ALIAS commits under the same global ID.
For a full view of GlobalID in action, take a look at our GlobalID illustration.
Track Provenance Downstream
Pachyderm provides the
wait commit <commitID> command that enables you to track your commits downstream as they are produced.
list commit <commitID>, each line is printed as soon as a new (sub) commit of your global commit finishes.
job to list the jobs related to your global job as they finish processing a commit.
Squash A Global Commit
pachctl squash commit <commitID> combines all the file changes in the commits of a global commit into their children and then removes the global commit. This behavior is inspired by the squash option in git rebase. No data stored in PFS is removed since they remain in the child commits.
Squashing a global commit on the head of a branch (no children) will fail.
Last update: November 1, 2021