Defining how your data is spread among workers is one of the most important aspects of distributed computation.
Pachyderm uses glob patterns to provide flexibility to define data distribution.
You can think of each input repository as a filesystem where the glob pattern is applied to the root. The files and directories that match the glob pattern constitute the datums that will be processed by the worker(s) that run your pipeline code.
You must configure a glob pattern for each PFS input of a pipeline specification.
The Pachyderm's concept of glob patterns is similar to the Unix glob patterns. For example, the
ls *.md command matches all files with the
.md file extension.
In Pachyderm, the
* indicators are most commonly used globs.
Let's list the glob patterns at your disposal. We will later illustrate their use in an example:
|Glob Pattern||Datum created|
| ||Pachyderm denotes the whole repository as a single datum and sends all input data to a single worker node to be processed together.|
| ||Pachyderm defines each top-level filesystem object, a file or a directory in the input repo, as a separate datum. For example, if you have a repository with ten files and no directory structure, Pachyderm identifies each file as a single datum and processes them independently.|
| ||Pachyderm processes each filesystem object in each subdirectory as a separate datum.|
| ||Pachyderm processes each filesystem object in all directories and subdirectories as a separate datum.|
Glob patterns also let you take only a particular directory or subset of directories as an input instead of the whole repo. We will elaborate on this more in the following example.
If you have more than one input repo in your pipeline, you can define a different glob pattern for each input repo. You can additionally combine the datums from each input repo by using the
group operator to create the final datums that your code processes. For more information, see Cross and Union, Join, Group.
Example of Defining Datums¶
Let's consider an input repo with the following structure where each top-level directory represents a US state with a
json file for each city in that state:
/California /San-Francisco.json /Los-Angeles.json ... /Colorado /Denver.json /Boulder.json ... /Washington /Seattle.json /Vancouver.json
Now let's consider what the following glob patterns would match respectively:
|Glob Pattern||Corresponding match||Example|
| ||This pattern matches ||If you add a new file |
| ||This pattern matches everything under the root directory. It defines one datum per state, which means that all the cities for a given state are processed together by a single worker, but each state is processed independently.||If you add a new file |
| ||This pattern matches files only under the ||If you add a new file |
| ||This pattern matches all files under the root directory that start with the character ||In the example, the |
| ||This pattern matches everything that's two levels deep relative to the root.||If we add County sub-directories to our states, |
| ||The match is applied at all levels of your directory structure. This is a recursive glob pattern. Let's look at the additional example below for more detail.|
Example: The case of the
/** glob pattern
Say we have the following repo structure:
/nope1.txt /test1.txt /foo-1 /nope2.txt /test2.txt /foo-2 /foo-2_1 /nope3.txt /test3.txt /anothertest.txt
teststarting from our input repo's root directory. In this case, the resulting datums will be:
- /test1.txt - /foo-1/test2.txt - /foo-2/foo-2_1/test3.txt - /foo-2/foo-2_1/anothertest.txt
Test a Glob pattern¶
You can use the
pachctl glob file command to preview which filesystem objects a pipeline defines as datums. This command helps you to test various glob patterns before you use them in a pipeline.
- If you set the
/, Pachyderm detects all top-level filesystem objects in the
trainrepository as one datum:
pachctl glob file train@master:/
NAME TYPE SIZE / dir 15.11KiB
- If you set the
/*, Pachyderm detects each top-level filesystem object in the
trainrepository as a separate datum:
pachctl glob file train@master:/*
NAME TYPE SIZE /IssueSummarization.py file 1.224KiB /requirements.txt file 74B /seq2seq_utils.py file 13.81KiB
Test your Datums¶
The granularity of your datums defines how your data will be distributed across the available workers allocated to a job. Pachyderm allows you to check those datums:
- for a pipeline currently being developed
- for a past job
Testing your glob pattern before creating a pipeline¶
You can use the
pachctl list datum -f <my_pipeline_spec.json> command to preview the datums defined by a pipeline given its specification file.
The pipeline does not need to have been created for the command to return the list of datums. This "dry run" helps you adjust your glob pattern when creating your pipeline.
pachctl list datum -f edges.json
ID FILES STATUS TIME - images@8c958d1523f3428a98ac97fbfc367bae:/g2QnNqa.jpg - - - images@8c958d1523f3428a98ac97fbfc367bae:/8MN9Kg0.jpg - - - images@8c958d1523f3428a98ac97fbfc367bae:/46Q8nDz.jpg - -
Running list datum on a past job¶
You can use the
pachctl list datum <job_number> command to check the datums processed by a given job.
pachctl list datum d10979d9f9894610bb287fa5e0d734b5
ID FILES STATUS TIME ebd35bb33c5f772f02d7dfc4735ad1dde8cc923474a1ee28a19b16b2990d29592e30 images@8c958d1523f3428a98ac97fbfc367bae:/g2QnNqa.jpg - - ebd3ce3cdbab9b78cc58f40aa2019a5a6bce82d1f70441bd5d41a625b7769cce9bc4 images@8c958d1523f3428a98ac97fbfc367bae:/8MN9Kg0.jpg - - ebd32cf84c73cfcc4237ac4afdfe6f27beee3cb039d38613421149122e1f9faff349 images@8c958d1523f3428a98ac97fbfc367bae:/46Q8nDz.jpg - -
Now that the 3 datums have been processed, their ID field is showing.
list datumon a given job execution of a pipeline that enables stats allows you to additionally display the STATUS (running, failed, success) and TIME of each datum.
- List datum is a costly operation that might impact the performance of your cluster. If your primary goal is to check which datums have failed or succeeded, a good tradeoff would be to use the
--status-onlyflag. In this case, you would only display the ID and STATUS of each datum. You might want to follow up with inspect datum
to detail the files that a specific datum includes.