Versioned Data Concepts
Learn about the data versioning concepts used in Pachyderm.
December 5, 2022
Pachyderm data concepts describe version-control primitives that you interact with when you use Pachyderm.
These ideas are conceptually similar to the Git version-control system with a few notable exceptions. Because Pachyderm deals not only with plain text but also with binary files and large datasets, it does not process the data in the same way as Git. When you use Git, you store a copy of the repository on your local machine. You work with that copy, apply your changes, and then send the changes to the upstream master copy of the repository where it gets merged.
Pachyderm version control works slightly differently. In Pachyderm, only a centralized repository exists and you do not store any local copies of that repository. Therefore, the merge, in the traditional Git meaning, does not occur.
Instead, your data can be continuously updated in the master branch of your repo, while you can experiment with specific data commits in a separate branch or branches. Because of this behavior, you cannot run into a merge conflict with Pachyderm.
A pointer to a commit that moves along with new commits as they are submitted.
An atomic operation that snapshots and preserves the state of files/directories within a repository.
A Unix filesystem object (directory or file) that stores data.
The collective record of version-controlled commits for pipelines and jobs.
The recorded data lineage that tracks the dependencies and relationships between datasets.
A top-level data object inside Pachyderm that behaves as a location where data is stored.