Profiling in the Data Integrity Suite includes categories that provide insights into data characteristics. Each category displays relevant metrics, including counts and percentages of valid, invalid, and null entries, along with visual representations such as bar charts for data distribution and value counts.
The sample categories are always the first categories that display
and should always have data. However, there is no validation for any
required fields in the API, other than
profileSetDate.
Sample summary
| Field | Description | Source |
|---|---|---|
| Effective Date | Date of the latest set of profiling information received for the asset. |
profileSetDate
|
| Total Row Count | Total number of rows in an entire data set. |
totalCount
|
| Sample Row Count |
Number of rows that are profiled and a percentage of the total count. |
sampleCount
|
| Base Type | Type of data. |
type
|
| Type Confidence | Displays the confidence level as a percentage that the profiling results accurately reflect the specified data type. The confidence level is presented with two decimal points of precision. For instance, a value of .9753 from the API is shown as 97.53% in the user interface, indicating the certainty of the data type, such as a date field |
confidence
|
Sample quality
Next to each calculated percentage, there is a tool tip that displays the percentage relative to a total. This total can vary, such as the total of the sample, the total of valid entries, and others.
| Field | Description | Source |
|---|---|---|
| Quality bar | A single horizontal bar displays counts of valid, invalid, and not populated rows from the sample data. | |
| Valid |
Counts and percentages of valid values in the sample, based on Type or Semantic Type. Percentage calculated as Valid Count divided by Sample Count.
|
matchCount
|
| Invalid/Outliers | Counts and percentages of invalid or outlier values in the sample. Percentage calculated as Invalid/Outliers Count divided by Sample Count. |
outlierCount
|
| Null/Blank |
Counts and percentages of null or blank entries in the sample. Percentage calculated as Not Populated Count divided by Sample Count. |
nullCount +
blankCount
|
Sample distribution
The bar chart illustrates the distribution of samples based on their data type. Here's how it represents different types of data:
- Date/Time: The chart displays the distribution across various time points.
- String: It shows the distribution according to different string values.
- Number: The chart presents the range distribution and includes the standard deviation and mean of the values.
- Boolean: It indicates whether the values are true or false.
The chart uses green bars to represent valid data points. It also marks invalid data or outliers with red bars and represents null or blank values with gray bars.
Top values, bottom values, invalid/outliers and shapes
These categories all function similarly, displaying only when data is available. Each category is represented as a bar chart that shows both the value and its count. Adjacent to the count, a percentage is shown. This percentage is calculated by dividing the value count by the total number of samples.
-
Top values: These are derived from
topKand include the count of each incardinalityDetail. -
Bottom values: These originate from
bottomKand also include the count of each incardinalityDetail. -
Invalid/Outlier values: Both the values and their
counts are detailed in
outlierDetail. -
Shapes: Both the values and their counts are detailed
in
shapesDetail.
Statistics
Several statistics are provided via the APIs. The availability of
specific statistical values partly depends on the data type. For
instance, if the data type is boolean, then
only Blank Count, Null Count, and Validation Regular Expression
will be displayed, provided they have values.
| Label | Source |
|---|---|
| Null Count |
nullCount
|
| Blank Count |
blankCount
|
| Minimum Value |
min
|
| Maximum Value |
max
|
| Minimum Length |
minLength
|
| Maximum Length |
maxLength
|
| Mean |
mean
|
| Standard Deviation |
standardDeviation
|
| Multiline |
multiline
|
| Leading Whitespace |
leadingWhiteSpace
|
| Trailing Whitespace |
trailingWhiteSpace
|
| Leading Zero Count |
leadingZeroCount
|
| Validation Regular Expression |
regExp
|