Reference
PachCTL

File

Learn about the concept of a file.

A file is a Unix filesystem object, which is a directory or file, that stores data. Unlike source code version-control systems that are most suitable for storing plain text files, you can store any type of file in Pachyderm, including binary files. Often, data scientists operate with comma-separated values (CSV), JavaScript Object Notation (JSON), images, and other plain text and binary file formats. Pachyderm supports all file sizes and formats and applies storage optimization techniques, such as deduplication, in the background.

To upload your files to a Pachyderm repository, run the pachctl put file command. By using the pachctl put file command, you can put both files and directories into a Pachyderm repository.

⚠️
  • It is important to note that directories are implied from the paths of the files. Directories are not stored and will not exist unless they contain files.
  • Do not use regex metacharacters in a path or a file name.

File Processing Strategies #

Pachyderm provides the following file processing strategies:

Overwriting Files #

By default, when you put a file into a Pachyderm repository and a file by the same name already exists in the repo, Pachyderm overwrites the existing file with the new data. For example, you have an A.csv file in a repository. If you upload the same file to that repository, Pachyderm overwrites the existing file with the data, which results in the A.csv file having only data from the most recent upload.

Example #

  1. View the list of files:

    pachctl list file images@master

    System Response:

    NAME   TYPE SIZE
    /A.csv file 258B
  2. Add the A.csv file once again:

    pachctl put file images@master -f A.csv
  3. Verify that the file size has not changed:

    pachctl list file images@master

    System Response:

    NAME   TYPE SIZE
    /A.csv file 258B

Appending to files #

When you enable the append mode by using the --append flag or -a, the new files are appended to existing ones instead of overwriting them. For example, you have an A.csv file in the images repository. If you upload the same file to that repository with the --append flag, Pachyderm appends to the file.

Example #

  1. View the list of files:

    pachctl list file images@master

    System Response:

    NAME   TYPE SIZE
    /A.csv file 258B
  2. Add the A.csv file once again:

    pachctl put file -a images@master -f A.csv
  3. Verify that the file size has doubled:

    pachctl list file images@master

    System Response:

    NAME   TYPE SIZE
    /A.csv file 516B