A Data Quality pipeline can be executed in any supported processing environment. Every pipeline run or catalog connection is logged as a job in Data Integrity Suite.
Once you have created and tested a pipeline application using a sample dataset, you can set up a run configuration. Run configurations are used to process data from a data source and define the environment, as well as the input and output data assets. Data Quality pipeline jobs are defined by a pipeline run configuration. Pipeline run configurations are created and edited on the Pipeline Editor page. A run configuration specifies:
- Connection
- Pipeline engine
- Source dataset and target dataset
- Target options
To see data quality jobs, navigate to from the main navigation menu. On this tab, you can view the following:
- Click a column heading to reorder jobs in ascending or descending order by entries in the column. You can click the filter button in a column heading to filter jobs by values in a column or to clear an existing filter.
-
Select the check box next to the ID column to either filter or delete the jobs.
-
Click the Refresh button on the toolbar to refresh entries on the table.
Column Description ID This is an integer that is assigned sequentially in the order that a quality job is started. Click the ellipsis to either Delete or generate a Quick Run. Pipeline The pipeline on which a quality job was run. Click on the pipeline name to preview the pipeline definition page. Run Configuration The environment in which a quality job was run. Start Time The date and time at which a quality job started. Duration The time that it took to complete the quality job ( HH:MM:SS).User The user name associated with a quality job. Status The current status for a quality job. The status available are: - Ready
- Pending
- Running
- Paused
- Successful
- Failed
- Terminating
- Cancelled
- Unknown
What causes a quality job to fail
Any of the following issues with a run configuration will cause a job to fail:
- The schema of the source dataset does not match the schema of the pipeline input.
- The schema of the target dataset does not match the schema of the pipeline output.
- The connection is deleted. The connection specified by the run configuration must be available when a job is run. When a connection is deleted, any run configurations that specify the connection are rendered invalid.
- The pipeline engine is deleted. The pipeline engine specified by a run configuration must be available when a job is run. When a pipeline engine is deleted, any run configurations that specify the pipeline engine are rendered invalid.