A Cron pipeline is triggered by a set time interval instead of whenever new changes appear in the input repository.
About Cron Pipelines #
Use Cases #
Cron pipelines are great for tasks like:
- Scraping websites
- Making API calls
- Querying a database
- Retrieving a file from a location accessible through an S3 protocol or a File Transfer Protocol (FTP).
Behavior #
When you create a Cron pipeline, Pachyderm creates a new input data repository that corresponds to the cron
input Then, Pachyderm automatically commits a timestamp file to the cron
input repository at your determined interval, which triggers the pipeline.
By default, each cron trigger adds a new tick file to the cron input repository, accumulating more datums over time.
Optionally, you can set the overwrite flag to true
to overwrite the timestamp file on each tick. To learn more about overwriting commits in Pachyderm, see Datum processing.
Required Parameters #
At minimum, a Cron pipeline must include all of the following parameters:
Parameters | Description |
---|---|
"name" | A descriptive name of the cron pipeline. |
"spec" | The interval between scheduled cron jobs; accepts RFC 3339 inputs, Predefined Schedules (@daily ), and Intervals (@every 1h30m20s ) |
Callouts #
- Avoid using intervals faster than 1-5 minutes
- You can use
never
during development and manually trigger the pipeline - If using jsonnet, you can pass arguments like:
--arg cronSpec="@every 5m"
- You cannot update a cron pipeline after it has been created; instead, you must delete the pipeline and build a new one.
Examples #
Every 60 Seconds #
"input": {
"cron": {
"name": "tick",
"spec": "@every 60s"
}
}
Daily with Overwrites #
"input": {
"cron": {
"name": "tick",
"spec": "@daily",
"overwrite": true
}
}
SQL Ingest with Jsonnet #
pachctl update pipeline --jsonnet https://raw.githubusercontent.com/pachyderm/pachyderm/2.3.x/src/templates/sql_ingest_cron.jsonnet \
--arg name=myingest \
--arg url="mysql://root@mysql:3306/test_db" \
--arg query="SELECT * FROM test_data" \
--arg hasHeader=false \
--arg cronSpec="@every 60s" \
--arg secretName="mysql-creds" \
--arg format=json
See Also: Periodic Ingress from MongoDB