Provenance

Learn about the concept of provenance in Pachyderm.

February 8, 2023

Data versioning (History) enables Pachyderm users to go back in time and see the state of a dataset or repository at a particular moment.

Data provenance (from the French noun provenance which means the place of origin), also known as data lineage, tracks the dependencies and relationships between datasets. It answers the question “Where does the data come from?”, but also “How was the data transformed along the way?”.

Pachyderm enables its users to have both: track all revisions of their data and understand the connection between the data stored in one repository and the results in the other repository.

It automatically maintains a complete audit trail, allowing all results to be fully reproducible.

The following diagram is an illustration of how provenance works:

Provenance example

In the diagram above, two input repositories (model and test_data) feed a scoring pipeline. The model repository continuously collects model artifacts from an outside source. The pipeline combines the data from these two repositories and scores each model against the validation dataset.

Global ID is an easy tool to help you track down your entire provenance chain and understand what data and transformation processes were involved in a specific version of a dataset. Here, the ID1 is shared by all commits and jobs involved in creating the final scoring v1. A simple pachctl list commit ID1 (pachctl list job ID1) will return that list at once.

Tracking Direct Provenance in Pachyderm #

Pachyderm provides the pachctl inspect command that enables you to track the direct provenance of your commits and learn where the data in the repository originates in.

Example #

pachctl inspect commit edges@71c791f3252c492a8f8ad9a51e5a5cd5 --raw

System Response:

{
    "commit": {
        "branch": {
        "repo": {
            "name": "edges",
            "type": "user"
        },
        "name": "master"
        },
        "id": "71c791f3252c492a8f8ad9a51e5a5cd5"
    },
    "origin": {
        "kind": "AUTO"
    },
    "parent_commit": {
        "branch": {
        "repo": {
            "name": "edges",
            "type": "user"
        },
        "name": "master"
        },
        "id": "b6fc0ab2d8d04972b0c31b0e35133323"
    },
    "started": "2021-07-07T19:53:17.242981574Z",
    "finished": "2021-07-07T19:53:19.598729672Z",
    "direct_provenance": [
        {
        "repo": {
            "name": "edges",
            "type": "spec"
        },
        "name": "master"
        },
        {
        "repo": {
            "name": "images",
            "type": "user"
        },
        "name": "master"
        }
    ],
    "details": {
        "size_bytes": "22754"
    }
}

Traversing Provenance and Subvenance #

In Pachyderm, all the related steps in a DAG share the same identifier, making it easy to traverse the provenance and subvenance in any commit.

All it takes is to run pachctl list commit <commitID> to get the full list of all the branches with commits created along the way due to provenance relationships.

Visit the Global ID Page for more details.