Data Caching (CDRs)

Pachyderm’s Common Data Refs (CDRs) feature optimizes the handling of large, remote datasets in pipeline operations. Instead of downloading entire datasets for each pipeline run, CDRs enable local caching of data, significantly improving efficiency and performance. This approach offers several benefits:

  • Reduces time spent on data transfer
  • Minimizes network usage
  • Enables faster pipeline execution
  • Allows for efficient handling of version-controlled data

By leveraging CDRs, Pachyderm provides a solution that balances the need for up-to-date data with the performance advantages of local data access, making it ideal for workflows involving large, frequently-used datasets.

Before You Start

Usage of the Common Data Refs (CDRs) feature requires the following:

  • You must use Pachyderm version 2.11.0+
  • You must install the cdr extras package for the Pachyderm SDK pachyderm_sdk[cdr]==2.11.0+
  • You must be using a storage backend that is S3-compatible.

How to Cache Data via Common Data Refs (CDRs)

The following high-level walkthrough uses the Jupyterlab Extension to create a pipeline and define your user code.

  1. Create an input repo with your files. For example, default/cdrs-demo-input.
  2. Add pachyderm_sdk[cdr]==2.11.2 to your requirements.txt file.
  3. Create a notebook. For example, notebook.ipynb.
  4. Add the following imports and define a cache location.
    import os
    from pachyderm_sdk import Client
    from pachyderm_sdk.api import pfs, storage
    
    CACHE_LOCATION = os.path.join(os.getcwd(), "/cache")
  5. Obtain the required Pipeline Worker environment variables FILESET_ID and PACH_DATUM_ID needed to assemble the fileset.
    fileset_id = os.environ['FILESET_ID']
    datum_path = f"/pfs/{os.environ['PACH_DATUM_ID']}"
  6. Initialize the Pachyderm client.
    client = Client(
        host='192.168.64.3',
        port=80,
        auth_token=os.environ['PACH_TOKEN'],
    )
  7. Assemble the fileset.
    client.storage.assemble_fileset(
        fileset_id,
        path=datum_path,
        cache_location=CACHE_LOCATION,
        destination="/pfs/out/",
        fetch_missing_chunks=True,
    )
  8. Create an input spec with the following details:
    pfs:
       name: default_cdrs-demo-input_master
       repo: cdrs-demo-input
       glob: /*
       empty_files: true # required
  9. Create and run the pipeline with the specified input spec and the notebook you created.