A Data Quality pipeline automates the movement and transformation of data. A pipeline ingests data from a dataset and performs a series of steps that transform the data.
Steps in a pipeline may include tasks such as data standardization, de-duplication, cleaning, validation, and reformatting. Each step in a transformation uses the output from preceding steps. Strung together in a pipeline, the steps perform a sequence of operations that transform source data into a desired quality and format. The pipeline then outputs the clean data to a data sink such as the source dataset, a new dataset, or a file. Data Quality pipelines can ensure the accuracy, consistency, uniqueness, integrity, and validity of data before you upload it to its final destination.
Pipelines list page
The table on this page lists pipelines. On this page you can create a new pipeline or edit, delete, or rename any existing pipeline.
This page is displayed when you go to tab on the main navigation menu.
- Search: Type any part of a pipeline name to only show pipelines with matching names.
- Create Pipeline: Click this button to create a new pipeline.
- Name: Shows the name of a pipeline. Click the ellipsis to Edit, Delete, or Rename, or Duplicate a pipeline. You can also choose Run configurations to create, edit, or run a configuration.
- Dataset: The dataset processed by the pipeline. You can click a dataset name to edit the dataset.
- Status: Indicates whether a pipeline is error-free or is invalid to run.
- Modified By: Displays the name of the user who last updated the pipeline.
- Last Modified: Displays the latest date when the pipeline was updated.
Pipeline data sample profiling
The Data Quality pipeline page provides profiling information for sample data. The profiling feature characterizes data in each field and checks for anomalies that may require cleanup before production data is delivered by a pipeline.
The system checks individual fields to confirm that their contents agree with their base and semantic type. For example, if the semantic type is Telephone Number, then alpha entries represent a problem. Similarly, if the semantic type is Email Address, then the absence of a domain name (@gmail.com) or consecutive dots (John.Doe@gmail.com) may represent a problem.
Pipeline suggestions
Data Quality pipelines offer suggestions to add steps based on columns in the sample dataset.
As you create or edit a pipeline, you can expand the Suggestions panel to view recommendations for columns and entities in the sample dataset. Suggestions for columns are based on their semantic types. Suggestions for an entity (delimited by the entity bar above the column headings) are based on the semantic types of the columns in the entity. By default, the Suggestions panel displays up to 10 suggestions for a pipeline.
Examples:
Here are some of the suggestions you might encounter:
- If the data includes an address entity, the system will suggest you add the Verify Address & Geocoding step.
- If Full Name or Company Name are columns, then the system will suggest you add the Parse Name step.
- If Email or Mobile Phone are columns, then the system will suggest you add the Parse Email or Parse Phone Number step.
- If Semantic First Name is a column, then the system will suggest you add first name standardization in the Standardize Field step.
Each suggestion identifies a recommended step and the column to which it may be applied. If there are no suggestions for the selected column or entity, they won't be grouped or categorized.