The Consolidate Matches step is used to consolidate inconsistent duplicate records by either creating a new golden record or replacing them with consistent records.
Enterprises have multiple versions of the same business records from different sources. These records are inconsistent and therefore unreliable. It is required to identify these duplicate records for a given business field (customer/organization/address) and then combine these duplicates to obtain the most reliable record. The accuracy of the resultant record is determined by the confidence level, which depends on these factors.
- Quality of source: refers to the reliability and trustworthiness of the data source itself.
- Chronology/recency: pertains to the time when it was collected or last updated.
- Frequency/plurality: frequent updates ensure that the data is up-to-date and reflects the current state of affairs.
- Completeness: refers to the extent to which all relevant information is present in the dataset.
- Validity: refers to whether the data accurately represents the real-world entities or events it intends to describe.
- Business rules: specific rules or guidelines that dictate how data should be handled and managed.
The Consolidate Matches step provides two approaches—best of breed and commonize.
Approach
Best of breed—The best of breed approach uses rules to create a golden record.
Commonize—The Commonize approach determines which fields from a collection of records to copy to the corresponding fields of all records in the collection.
Save
Click this button to close settings and save changes to the transformation settings.
Preview
Click this button to preview the results of the transformation settings.
Cancel
Click this button to close settings for this transformations without saving any changes.
Best of breed approach
The best of breed approach uses rules to create a golden record.
It consolidates duplicate records by selecting the best data in a duplicate record collection and creating a new consolidated record using the best data. You can define multiple conditions and actions for each rule as needed. When processing completes, the best of breed record is retained by the system.
The Best of Breed rule ensures data accuracy and reliability. It protects data-driven processes, eliminates inaccuracies, and aids in decision-making. Data professionals implement the rule and evaluate records based on predefined criteria. It applies in data warehouses, CRM systems, and more. The rule compares records and retains the highest-quality entry as the survivor record also called as the single source of truth record.
Group by
Specifies the column in your dataset that you want to use as the basis
for grouping. For example, you might have a dataset of sales data with
columns like product, region, and
sales_amount. If you want to apply the best of
breed rule to find the highest sales amount for each product, you would
group the data by the product field.
Sorts data within the group based on the selected Group by column. The order of records impacts the selection of the golden record when you use the first record as the starter record. The default sort may not yield the expected order, and it may select a starter record that is unhelpful, potentially impacting your data decisions. When you select Sort within group, it arranges the fields in the selected Group by column according to your specified sorting order.
Keep original records
Allows you to retain your input records along with the best of breed record. Not selecting the checkbox will remove other records and retain only the best of breed record.
Starter Record
The record that serves as the initial or reference point for defining how consolidation rules should be applied to other similar records. You can choose the starter record based on certain criteria, such as its quality, accuracy, or completeness.
First record
Starter record conditions
Specifies conditions that are used to determine which record should be selected as the starter record for defining consolidation rules. These conditions help identify the initial reference point from which consolidation rules are established to determine the golden record.
Consolidation Rules
Specifies the rules that help select the most accurate, relevant, or highest-quality data from multiple sources. Consolidation rules help determine the best or most reliable record from the data that has inconsistencies, duplicates, or conflicting information.
Add Consolidation Rules
Allows you to add multiple conditions and actions that you want to define as the best of breed rule. You can add more than one consolidation rules. You can also edit or delete an existing rule if needed.
Add consolidation rules
Conditions
Specifies single or multiple conditions to set rules. You can add group of conditions for a more complex rule. The operators are shown based on the column type selected.
Add Condition
Add Group Condition
Actions
Specifies actions for your provided conditions or group of conditions.
Add Action
Lets you add actions for your defined conditions. You can choose to copy the data from selected field or copy from string that you provide, to the field that you want.
Copy From Field: copies the selected field value to the best of breed record.
Copy From String: copies a constant value to the best of breed record.
Apply
Saves the provided conditions and actions.
Output field
- Primary: This represents the selected template record within the collection of records.
- Secondary: This denotes a record that is not the chosen template within the collection of records.
- BestOfBreed: This signifies the newly created best-of-breed
record within the collection of records.Note: The Primary and Secondary values are generated only when you define Starter record conditions.
The output field is created based on the action defined for a particular condition or group condition. You can use Copy From Field under Actions to copy the content of a specific column to a new column using the consolidation rules.
Commonize approach
The Commonize approach determines which fields from a collection of records to copy to the corresponding fields of all records in the collection.
You have to specify rules that records must satisfy to copy the field data to the other records in the collection. When processing completes, all records in the collection are retained.
The Commonize rule ensures standardization and unification of data from diverse sources. It aims to create a consistent and homogeneous dataset, preventing discrepancies. It applies to data integration, data consolidation, and data cleansing. Having consistent data, enhances data quality, and enables better decision-making processes, fostering efficiency and accuracy in their data-centric operations.
Level 1
Specifies the first pass for applying grouping and consolidation rules to the dataset.
Add Level
Allows you to add more levels that can be used to apply different grouping and consolidation rules for each level. The result of level 1 is used as the input for level 2, and so on. You can add multiple levels based on the duplicate or inconsistent fields present in your dataset.
Group by
Specifies the column that you want to use to group your resultant record. This is usually the column by which you group or categorize data items that share common characteristics. This grouping allows you to apply commonizing rules consistently within each group, ensuring that data is standardized in a coherent manner.
Sorts data within the group based on the selected Group by column. The order of records impacts the selection of the commonize record. The default sort may not yield the expected order, and it may select a template record that is unhelpful, potentially impacting your data decisions. When you select Sort within group, it arranges the fields in the selected Group by column according to your specified sorting order.
Consolidation Rules
Specifies the rules that help address issues such as inconsistent naming conventions, formats, or values across different datasets. Consolidation rules ensure that data is consistent and aligned based on predefined standards.
Add Consolidation Rules
Allows you to add multiple conditions and actions that you want to define as commonizing rule. You can add more than one consolidation rules. You can also edit or delete an existing rule if needed.
Add consolidation rules
Conditions
Specifies single or multiple conditions to set rules. You can add group of conditions for a more complex rule. The operators are shown based on the column type selected.
Add Condition
Add Group Condition
Actions
Specifies actions for your provided conditions or group of conditions.
Add Action
Lets you add actions for your defined conditions. You can choose to copy the data from selected field or copy from string that you provide, to the field that you want.
Copy From Field: copies a value from the selected field to the other records in the group.
Copy From String: copies a constant value to the other records in the group.
Apply
Saves the provided conditions and actions.
Output field
The output field is created based on the action defined for a particular condition or group condition. You can use Copy From Field under Actions to copy the content of a specific column to a new column using the consolidation rules.
Consolidation rule operators
Use these operators to define conditions within a condition group for the consolidation rules.
Numeric field operators
Use numeric operators to evaluate values in Integer, Long, Float, Date, DateTime, and Time fields.
| Operator | Expression true when |
|---|---|
| Equal | Field value is equal to a literal or field value. |
| Not equal | Field value is not equal to a literal or field value. |
| Highest | Field value is the highest than any other field values. |
| Lowest | Field value is the lowest than any other field values. |
| Greater than | Field value is greater than a literal or field value. |
| Less than | Field value is less than a literal or field value. |
| Greater than or equal | Field value is greater than or equal to a literal or field value. |
| Less than or equal | Field value is less than or equal to a literal or field value. |
| Empty | Field value contains no value. |
| Not Empty | Field value contains any value. |
String field operators
Use string operators to evaluate values in String fields.
| Operator | Expression true when |
|---|---|
| Equal | Field string value matches a literal or field value. |
| Not equal | Field string value does not match a literal or field value. |
| Contains | Field string value contains a literal or field value. |
| Empty | Field value contains no value. |
| Not empty | Field value contains any value. |
| Longest | Field string value has the longest (in bytes) value in the field. |
| Shortest | Field string value has the shortest (in bytes) value in the field. |
| Most Common | Field string value that occurs most frequently in this field among the records in the group. If two or more values are most common, no action is taken. |