File

A file is a Unix filesystem object, which is a directory or file, that stores data. Unlike source code version-control systems that are most suitable for storing plain text files, you can store any type of file in Pachyderm, including binary files. Often, data scientists operate with comma-separated values (CSV), JavaScript Object Notation (JSON), images, and other plain text and binary file formats. Pachyderm supports all file sizes and formats and applies storage optimization techniques, such as deduplication, in the background.

To upload your files to a Pachyderm repository, run the pachctl put file command. By using the pachctl put file command, you can put both files and directories into a Pachyderm repository.

Appending and overwriting

When you add a file by using the pachctl put file command, Pachyderm can either append the file to the already existing file or overwrite the existing file.

Appending files By default, when you put a file into a Pachyderm repository, and a file by the same name already exists in the repo, Pachyderm appends the new data to the existing file. For example, you have an A.csv file in a repository. If you upload the same file to that repository, Pachyderm appends the data to the existing file, which results in the A.csv file having twice the data from its original size.

Example:

  1. View the list of files:

    $ pachctl list file [email protected]
    NAME   TYPE SIZE
    /A.csv file 258B
    
  2. Add the A.csv file once again:

    $ pachctl put file [email protected] -f A.csv
    
  3. Verify that the file has doubled in size:

$ pachctl list file [email protected]
NAME   TYPE SIZE
/A.csv file 516B

Overwriting files When you enable the overwrite mode by using the --overwrite flag or -o, the file replaces the existing file instead of appending to it. For example, you have an A.csv file in the images repository. If you upload the same file to that repository with the --overwrite flag, Pachyderm overwrites the whole file.

Example:

  1. View the list of files:

    $ pachctl list file [email protected]
    NAME   TYPE SIZE
    /A.csv file 258B
    
  2. Add the A.csv file once again:

    $ pachctl put file --overwrite [email protected] -f A.csv
    
  3. Check the file size:

    $ pachctl list file [email protected]
    NAME   TYPE SIZE
    /A.csv file 258B