Run Commands
Read the PPS series >

Input Group PPS

Group files stored in one or multiple repos.

Spec #

This is a top-level attribute of the pipeline spec.

{
  "pipeline": {...},
  "transform": {...},
  "input": {
    "group": [
      {
        "pfs": {
          "project": string,
          "name": string,
          "repo": string,
          "branch": string,
          "glob": string,
          "group_by": string,
          "lazy": bool,
          "empty_files": bool,
          "s3": bool
        }
      },
      {
        "pfs": {
          "project": string,
          "name": string,
          "repo": string,
          "branch": string,
          "glob": string,
          "group_by": string,
          "lazy": bool,
          "empty_files": bool,
          "s3": bool
        }
      }
    ]
  },
  ...
}

Attributes #

AttributeDescription
nameThe name of the PFS input that appears in the INPUT field when you run the pachctl list pipeline command. If an input name is not specified, it defaults to the name of the repo.
repoSpecifies the name of the Pachyderm repository that contains the input data.
branchThe branch to watch for commits. If left blank, Pachyderm sets this value to master.
globA wildcard pattern that defines how a dataset is broken up into datums for further processing. When you use a glob pattern in a group input, it creates a naming convention that Pachyderm uses to group the files.
group_byA parameter that is used to group input files by a specific pattern.
lazyControls how the data is exposed to jobs. The default is false which means the job eagerly downloads the data it needs to process and exposes it as normal files on disk. If lazy is set to true, data is exposed as named pipes instead, and no data is downloaded until the job opens the pipe and reads it. If the pipe is never opened, then no data is downloaded.
empty_filesControls how files are exposed to jobs. If set to true, it causes files from this PFS to be presented as empty files. This is useful in shuffle pipelines where you want to read the names of files and reorganize them by using symlinks.
s3Indicates whether the input data is stored in an S3 object store.

Behavior #

The group input in a Pachyderm Pipeline Spec allows you to group input files by a specific pattern.

To use the group input, you specify one or more PFS inputs with a group_by parameter. This parameter specifies a pattern or field to use for grouping the input files. The resulting groups are then passed to your pipeline as a series of grouped datums, where each datum is a single group of files.

You can specify multiple group input fields in a Pachyderm Pipeline Spec, each with their own group_by parameter. This allows you to group files by multiple fields or patterns, and pass each group to your pipeline as a separate datum.

The glob and group_by parameters must be configured.

When to Use #

You should consider using the group input in a Pachyderm Pipeline Spec when you have large datasets with multiple files that you want to partition or group by a specific field or pattern. This can be useful in a variety of scenarios, such as when you need to perform complex data analysis on a large dataset, or when you need to group files by some attribute or characteristic in order to facilitate further processing.

Example scenarios: