Run Commands
Read the PPS series >

Datum Set Spec PPS

Define how a pipeline should group its datums.

Spec #

This is a top-level attribute of the pipeline spec.

{
    "pipeline": {...},
    "transform": {...},
    "datum_set_spec": {
        "number": 0,
        "size_bytes": 0,
        "per_worker": 0,
    },
    ...
}

Attributes #

AttributeDescription
numberThe desired number of datums in each datum set. If specified, each datum set will contain the specified number of datums. If the total number of input datums is not evenly divisible by the number of datums per set, the last datum set may contain fewer datums than the others.
size_bytesThe desired target size of each datum set in bytes. If specified, Pachyderm will attempt to create datum sets with the specified size, though the actual size may vary due to the size of the input files.
per_workerThe desired number of datum sets that each worker should process at a time. This field is similar to number, but specifies the number of sets per worker instead of the number of datums per set.

Behavior #

The datum_set_spec attribute in a Pachyderm Pipeline Spec is used to control how the input data is partitioned into individual datum sets for processing. Datum sets are the unit of work that workers claim, and each worker can claim 1 or more datums. Once done processing, it commits a full datum set.

When to Use #

You should consider using the datum_set_spec attribute in your Pachyderm pipeline when you are experiencing stragglers, which are situations where most of the workers are idle but a few are still processing jobs. This can happen when the work is not divided up in a balanced way, which can cause some workers to be overloaded with work while others are idle.