Joining multiple inputs in a Data Quality pipeline allows you to relate different datasets to enrich records or compile a complete set of records from various sources.
Join operations in a pipeline enable the merging of records from multiple inputs based on specified keys, creating a single, coherent output view. This feature supports Inner, Left, Right, and Full joins, allowing users to define the nature of the relationship between the tables.
For example, a retail company may want to join customer transaction data with customer support records to provide a holistic view of customer interactions and purchasing behaviors. The transaction dataset contains purchase details, while the support dataset contains customer inquiries and issues. By joining these datasets on a common customer ID field, the company can analyze the data to improve customer service and tailor marketing strategies.
Suppose we have two datasets: one containing transaction data and the other containing customer support records.
| Customer ID | Transaction ID | Product Purchased | Purchase Amount |
|---|---|---|---|
| 001 | 1001 | Product A | $50 |
| 002 | 1002 | Product B | $30 |
| 003 | 1003 | Product C | $70 |
| 004 | 1004 | Product A | $40 |
| 005 | 1005 | Product D | $60 |
| Customer ID | Support Ticket ID | Issue | Resolution |
|---|---|---|---|
| 001 | 5001 | Shipping Delayed | Refunded |
| 002 | 5002 | Missing Parts | Replaced |
| 003 | 5003 | Product Quality | Replacement |
| 004 | 5004 | Wrong Item | Refunded |
| 005 | 5005 | Product damaged | Exchange |
To provide a holistic view of customer interactions and purchasing behaviors, we can join these datasets on the common field "Customer ID". After joining, we might get a combined dataset like this:
| Customer ID | Transaction ID | Product Purchased | Purchase Amount | Support Ticket ID | Issue | Resolution |
|---|---|---|---|---|---|---|
| 001 | 1001 | Product A | $50 | 5001 | Shipping Delayed | Refunded |
| 002 | 1002 | Product B | $30 | 5002 | Missing Parts | Replaced |
| 003 | 1003 | Product C | $70 | 5003 | Product Quality | Replacement |
| 004 | 1004 | Product A | $40 | 5004 | Wrong Item | Refunded |
| 005 | 1005 | Product D | $60 | 5005 | Product damaged | Exchange |
In this combined dataset, each row represents a customer's transaction along with any associated support tickets, providing a holistic view of both their purchasing behaviors and interactions with customer support.
To append multiple inputs.
- While building the pipeline, select Add Step on the canvas where you want to branch.
- In the Add Step dialog, select Join.Note: The join operator can only be applied to the open end of a branch, not to any point within the branch.
- Choose the datasets to join by selecting the left and right inputs in the properties panel.
- Select the datasets from either the branches in the current pipeline or from cataloged datasets.
- Define the join keys by mapping corresponding fields from both inputs.
- Select the fields to be used as join keys from the dropdown lists.
- Join keys are the fields based on which the datasets will be merged.
- Choose the type of join—Inner, Left, Right, or Full—according to the desired output.
- Click Configure Output to define the output view based on the input columns and the type of join.Tip: You can include additional fields such as Record Count, Source Identifier, and Source Record Counter as part of the output.
- Preview the join result by clicking Preview.Note: It displays all columns from both the left and right input tables. The columns are shown in the order they appear, starting with the join key columns from both the left and right fields.
- If the preview is satisfactory, click Save to add the join step to the pipeline.
The datasets are joined according to the specified keys and join type, resulting in a unified dataset for further processing. Review the joined data and proceed with additional transformations or analyses as required.
Step properties
Left Input: The primary dataset to be joined. Select the dataset that will act as the main reference for the join operation.
Right Input: The secondary dataset to be joined. Choose the dataset that will be combined with the primary dataset based on the join keys.
The fields used to match records from both datasets. Specify which fields should be used to align records from the left and right inputs.
- Left field: Specifies the key field in the left input dataset. This field will be matched against the corresponding key in the right dataset.
- Right field: Specifies the key field in the right input dataset. This field will be matched against the corresponding key in the left dataset.
Add Join Key: Adds additional fields to append. Use this to specify more fields that should be used to determine matches between the two datasets.
- Inner: Keeps only records with matching keys in both datasets, resulting in a dataset containing only the intersection of both inputs.
- Left: Keeps all records from the left dataset and matching records from the right dataset, filling in with nulls where there is no match.
- Right: Keeps all records from the right dataset and matching records from the left dataset, filling in with nulls where there is no match.
- Full: Keeps all records from both datasets, filling in with nulls where there are no matches, effectively creating a union of the datasets.
- All fields: Includes all fields from both datasets in the output, providing a comprehensive view of combined data.
- Left input: Includes only fields from the left input dataset in the output, focusing on the primary datasets structure.
- Right input: Includes only fields from the right input dataset in the output, focusing on the secondary datasets structure.Note: If a field name exists in both tables, the system automatically adds a _Right suffix to the field name from the right table (this is case-insensitive). Users can update or add new field names in the Rename column. To configure output columns, simply select the check-box next to the desired field names.
Select the output fields that you want to include in joined dataset