Skip to content

File

A file is a Unix filesystem object, which is a directory or file, that stores data. Unlike source code version-control systems that are most suitable for storing plain text files, you can store any type of file in Pachyderm, including binary files. Often, data scientists operate with comma-separated values (CSV), JavaScript Object Notation (JSON), images, and other plain text and binary file formats. Pachyderm supports all file sizes and formats and applies storage optimization techniques, such as deduplication, in the background.

To upload your files to a Pachyderm repository, run the pachctl put file command. By using the pachctl put file command, you can put both files and directories into a Pachyderm repository.

Warning

  • It is important to note that directories are implied from the paths of the files. Directories are not stored and will not exist unless they contain files.
  • Do not use regex metacharacters in a path or a file name.

File Processing Strategies

Pachyderm provides the following file processing strategies:

Overwriting Files

By default, when you put a file into a Pachyderm repository and a file by the same name already exists in the repo, Pachyderm overwrites the existing file with the new data. For example, you have an A.csv file in a repository. If you upload the same file to that repository, Pachyderm overwrites the existing file with the data, which results in the A.csv file having only data from the most recent upload.

Example

  1. View the list of files:

    pachctl list file images@master
    

    System Response:

    NAME   TYPE SIZE
    /A.csv file 258B
    
  2. Add the A.csv file once again:

    pachctl put file images@master -f A.csv
    
  3. Verify that the file size has not changed:

    pachctl list file images@master
    

    System Response:

    NAME   TYPE SIZE
    /A.csv file 258B
    

Appending to files

When you enable the append mode by using the --append flag or -a, the new files are appended to existing ones instead of overwriting them. For example, you have an A.csv file in the images repository. If you upload the same file to that repository with the --append flag, Pachyderm appends to the file.

Example

  1. View the list of files:

    pachctl list file images@master
    

    System Response:

    NAME   TYPE SIZE
    /A.csv file 258B
    
  2. Add the A.csv file once again:

    pachctl put file -a images@master -f A.csv
    
  3. Verify that the file size has doubled:

    pachctl list file images@master
    

    System Response:

    NAME   TYPE SIZE
    /A.csv file 516B
    

Last update: July 23, 2022
Does this page need fixing? Edit me on GitHub