Quality scores in the Data Integrity Suite are calculated using a hierarchical aggregation methodology that combines individual rule scores across fields and datasets. This topic explains how scores are computed at each level and how multiple rules and fields contribute to the overall quality assessment.
Scoring hierarchy overview
Quality scores are calculated in a hierarchical structure that flows from individual rules up to the datasource level. Understanding this hierarchy is essential for interpreting quality metrics across your data assets.
The scoring hierarchy follows this order:
- Rule score: The base level score generated when a single rule is run against a field or dataset.
- Field score: An aggregate of all rule scores applied to a specific field.
- Dataset score: An aggregate of all the quality scores associated with the fields in the respective dataset.
- Datasource score: An aggregate of all dataset scores within a datasource.
Quality rule score calculation
A quality rule score is the foundational metric generated when a data quality rule is run. The score represents the percentage of records or data elements that satisfy the rule's pass condition.
Rule score formula:
Rule score = (Number of records passing Rule / Total records evaluated) × 100
Example:
If a completeness rule checks for non-null values in a customer email field and finds that 950 out of 1000 records contain valid email addresses, the rule score is calculated as:
Rule score = (950 / 1000) × 100 = 95%
Rule scores range from 0% to 100%, where 100% indicates that all evaluated records satisfy the rule condition.
<is not null> operator.Multiple rules applied to a single field
When multiple quality rules are configured for the same field, the field score is calculated by aggregating all associated rule scores. This provides a comprehensive quality assessment for that field across multiple dimensions or criteria.
Field score formula (Multiple rules):
Field score = Average of all rule scores applied to the field
Example:
Consider a customer name field with three quality rules:
- Rule 1 (Completeness): Checks for non-null values = 98% score
- Rule 2 (Validity): Checks for valid name format = 96% score
- Rule 3 (Consistency): Checks for consistent capitalization = 94% score
The field score is calculated as:
Field score = (98 + 96 + 94) / 3 = 96%
This field score of 96% represents the overall quality of the customer name field across all three quality dimensions.
Multiple fields within a dataset
Dataset score formula (Multiple fields):
Dataset score = Average of all field scores in the dataset
Example:
Consider a customer dataset with four fields:
- Customer ID field score = 99%
- Customer Name field score = 96%
- Email field score = 92%
- Phone field score = 88%
The dataset score is calculated as:
Dataset score = (99 + 96 + 92 + 88) / 4 = 93.75%
This dataset score of 93.75% represents the overall quality of the customer dataset across all fields.
Datasource score calculation
The datasource score represents the overall quality of all datasets and schemas within a datasource. It is calculated by aggregating the scores of all datasets contained in that datasource.
Datasource score formula:
Datasource score = Average of all dataset scores in the datasource
Example:
Consider a Snowflake datasource containing three datasets:
- Customer dataset score = 93.75%
- Orders dataset score = 91.5%
- Products dataset score = 95.2%
The datasource score is calculated as:
Datasource score = (93.75 + 91.5 + 95.2) / 3 = 93.48%
This datasource score of 93.48% represents the overall quality of all data within the Snowflake datasource.
Complex scoring scenarios
In real-world implementations, quality scoring often involves complex scenarios with multiple rules, fields, and dimensions. This section provides guidance on how scores are calculated in these situations.
Scenario 1: Multiple rules and multiple dimensions on a single field
When a field has multiple rules assigned to different dimensions:
- Each rule generates its own score based on rule evaluation.
- Rules are grouped by their assigned dimension.
- Dimension scores are calculated as the average of rules within that dimension.
- The field score is calculated as the average of all dimension scores for that field.
Example:
Customer Email field with rules across two dimensions:
-
Completeness dimension:
- Rule: Email not null = 98%
- Completeness Score = 98%
-
Validity dimension:
- Rule: Valid email format = 96%
- Rule: Email domain exists = 94%
- Validity Score = (96 + 94) / 2 = 95%
Field Score = (98 + 95) / 2 = 96.5%
Scenario 2: Dataset with mixed field coverage
When not all fields in a dataset have rules assigned:
- Only fields with assigned rules contribute to the dataset score calculation.
- Fields without rules are not included in the average.
- The dataset score reflects the quality of evaluated fields only.
Example:
Customer dataset with 10 fields, but only 6 have quality rules:
- Field 1 score = 99%
- Field 2 score = 96%
- Field 3 score = 92%
- Field 4 score = 88%
- Field 5 score = 94%
- Field 6 score = 91%
- Fields 7-10 = No rules assigned (not included in calculation)
Dataset Score = (99 + 96 + 92 + 88 + 94 + 91) / 6 = 93.33%
Score updates and refresh timing
Quality scores are updated when rules are run. The timing of score updates depends on how rules are configured:
- Manual runs: Scores update immediately when you manually run a rule from the Rules page or rule details page.
- Scheduled runs: Scores update according to the schedule configured for the rule. By default, rules follow the discovery schedule.
- Cascading updates: When a rule score updates, all dependent scores (field, dimension, dataset, and datasource) are automatically recalculated.
The timestamp displayed on the Scoring page indicates when the most recent rule evaluation occurred for that asset.
Interpreting quality scores
Quality scores provide a quantitative measure of data quality, but interpretation depends on your organization's data quality standards and requirements.
General Guidelines:
- 90-100%: Excellent quality. Data is suitable for critical business operations and analytics.
- 80-89%: Good quality. Data is generally suitable for most operations but may require attention in specific areas.
- 70-79%: Fair quality. Data has notable quality issues that should be addressed before use in critical operations.
- Below 70%: Poor quality. Data requires significant remediation before use in production systems.
Best practices for quality scoring
To maximize the effectiveness of quality scoring in your organization, follow these best practices:
- Comprehensive rule Coverage: Assign quality rules to all critical fields and datasets to ensure complete quality assessment.
- Regular rule runs: Schedule rules to run regularly to keep quality scores current and relevant.
- Threshold monitoring: Establish quality score thresholds for your organization and monitor assets that fall below acceptable levels.
- Trend analysis: Track quality score changes over time to identify improving or deteriorating data quality trends.