A Data Quality pipeline can be executed in any supported processing environment. Every pipeline run or catalog connection is logged as a job in Data Integrity Suite.
Once you have created and tested a pipeline application using a sample dataset, you can set up a run configuration. Run configurations are used to process data from a data source and define the environment, as well as the input and output data assets. Data Quality pipeline jobs are defined by a pipeline run configuration. Pipeline run configurations are created and edited on the Pipeline Editor page. A run configuration specifies:
- Connection
- Pipeline engine
- Source dataset and target dataset
- Target options
View pipeline jobs
To see data quality jobs, navigate to from the main navigation menu and go to the Pipelines tab. On this tab, you can view the following:
- Click a column heading to reorder jobs in ascending or descending order by entries in the column. You can click the Filter button in a column heading to filter jobs by values in a column or to clear an existing filter.
- Select the check box next to the ID column to either filter or delete the jobs.
- Click the Refresh button on the toolbar to refresh entries in the table.
| Column | Description |
|---|---|
| ID | This is an integer that is assigned sequentially in the order that a quality job is started. Click the Ellipsis to either Delete or generate a Quick Run. |
| Pipeline | The pipeline on which a quality job was run. Click on the pipeline name to preview the pipeline definition page. |
| Run Configuration | The environment in which a quality job was run. |
| Start Time | The date and time at which a quality job started. |
| Duration | The time that it took to complete the quality job (HH:MM:SS). |
| User | The user name associated with a quality job. |
| Status | The current status of a quality job. The available statuses are:
|
What causes a pipeline job to fail
Any of the following issues with a run configuration will cause a job to fail:
- The schema of the source dataset does not match the schema of the pipeline input.
- The schema of the target dataset does not match the schema of the pipeline output.
- The connection is deleted. The connection specified by the run configuration must be available when a job is run. When a connection is deleted, any run configurations that specify the connection are rendered invalid.
- The pipeline engine is deleted. The pipeline engine specified by a run configuration must be available when a job is run. When a pipeline engine is deleted, any run configurations that specify the pipeline engine are rendered invalid.