Ingest Data

Learn how to ingest data using the pachctl put command.

February 8, 2023

pachctl put file #

ℹ️

At any time, run pachctl put file --help for the complete list of flags available to you.

  1. Load your data into Pachyderm by using pachctl requires that one or several input repositories have been created.

    pachctl create repo <repo name>
  2. Use the pachctl put file command to put your data into the created repository. Select from the following options:

    • Atomic commit: no open commit exists in your input repo. Pachyderm automatically starts a new commit, adds your data, and finishes the commit.
    pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
    • Alternatively, you can manually start a new commit, add your data in multiple put file calls, and close the commit by running pachctl finish commit.

      1. Start a commit:
        pachctl start commit <repo>@<branch>
      2. Put your data:
        pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
      3. Put more data:
        pachctl put file <repo>@<branch>:</path/to/file2> -f <file2>
      4. Close the commit:
        pachctl finish commit <repo>@<branch>

Filepath Formats #

⚠️

Pachyderm uses *?[]{}!()@+^ as reserved characters for glob patterns. Because of this, you cannot use these characters in your filepath.

In Pachyderm, you specify the path to file by using the -f option. A path to file can be a local path or a URL to an external resource. You can add multiple files or directories by using the -i option. To add contents of a directory, use the -r flag.

The following table provides examples of pachctl put file commands with various filepaths and data sources:

ℹ️

If you are configuring a local cluster to access an external bucket, make sure that Pachyderm has been given the proper access.

Loading Your Data Partially #

Depending on your use case and the volume of your data, you might decide to keep your dataset in its original source and process only a subset in Pachyderm.

Add a metadata file containing a list of URL/path to your external data to your repo.

Your pipeline code will retrieve the data following their path without the need to preload it all. In this case, Pachyderm will not keep versions of the source file, but it will keep track and provenance of the resulting output commits.