Data profiling results help to analyze a profile for various parameters including completeness and uniqueness of data. It examines the data to identify potential issues compromising data integrity and trends help to identify a pattern in profiling thereby such as systematic errors or shifts in data quality, that could indicate underlying problems impacting data accuracy.
View profiling results
The Data Profiling results page displays a comprehensive analysis of a profile, including the completeness and uniqueness of the contained data. The report is generated at the profile level as well as table and column level. At the column level, the result is displayed as per the profiling rules.
To view profiling results:
- On the main navigation menu, click .
- Click the profile name that you want to view details.
- Profiling results are displayed for the selected Data Profile.
- Under table-level profiling details, you can click the name of the column to navigate to the respective column-level and view the profiling details.
Data profiling results page
The Data Profiling Results page can be divided into three parts:
- The page header: Displays profile details
- The left pane: Displays tables and columns included in it
- The report Pane: Displays a pictorial view of the analyzed data
The page header
- Profile Name: Name of the data profile.
- Description: A brief description of the profile.
- Run Date: The date (yyyy/mm/dd), time (hh:mm:ss), and period (AM or PM) when the profile was recently run.
- User: The user's name who created the data profile.
- Duration: The time that is taken for profiling to complete.
- Rows Processed: The total number of rows profiled.Tip: You can view profile run history from the results page.
The left pane displays the tables selected for profiling, along with the columns included in it.You can click an individual table or column to view the respective results in the report pane.The report pane displays the result for the selected table or column.
The report pane
- Table: Displays table-level details.
- Column:
Displays column-level details.
These details are displayed when you select a table in the left pane. It displays the name of the table along with these details:
Table level profiling details
These details are displayed when you select a table in the left pane. It displays the name of the table along with these details:
- Completeness (%): Displays the complete and incomplete rows (in percentage) in the table.
-
Asset Details: Dipalys details of the associated assets. Asset
details may vary based on the connection type.
- Schema: Name of the schema used
- Connection: Name of the data connection used in profiling
- Type: The data connection type used in profiling
- Host: The location or address of the data source on the network
- Account: The account name used for data connection
- Warehouse: Name of the warehouse used
- Database: Name of the database used for profiling
- Table popularity: Shows the number of times the table is used in the database queries in the last 24 hours. This is the count of both manual queries run on your database instance and queries run during Profile/Observer runs.Table and column popularity is available for the following data sources: Snowflake, Redshift, and BigQuery.
-
Table Summary: It displays these details for every column in the table.
- Column: Names of all the columns in the table. You can click the column name to view column-level profiling details.
- Data Type: The data type in the column.
- Completeness (%): Completeness of records in the column.
- Uniqueness: Uniqueness of the data contained in the column.
- Popularity (Last 24 Hours): Shows the number of times the table is used in the database queries in the last 24 hours. This is the count of both manual queries run on your database instance and queries run during Profile/Observer runs.
- Null Count: Displays the number of null values in every row of the table.
- Semantic Type: Displays the detected semantic type, such as email, phone, city, first name, last name, and so on in the string in this column.
- Stats: Displays various other statistics.
- For integer: Standard Deviation, Minimum value, Average, Maximum value, Variance.
- For string: Min Length, Max Length, and Text Patterns.
- Snowflake: The table and column popularity is shown for Snowflake users having accountadmin role or roles having access to the Snowflake database. The accountadmin can grant Snowflake database access to other roles. To grant Snowflake database access to other roles, refer to Enabling Snowflake database usage for other roles.
Understanding Popularity Count Information in Different Data Warehouse
- Redshift: In Redshift, the popularity count information includes details about tables accessed or queried during the current Observer run, along with the count of queries that ran for respective tables in the past 24 hours.
- BigQuery: In BigQuery, the popularity count information might not immediately include queries executed during the current Observer run, but it does include the count of queries that ran in the past 24 hours.
- Snowflake: In Snowflake, the popularity count information doesn't include details about tables accessed or queried during the current Observer run, but it does include the count of queries that ran in the past 24 hours.
Column level profiling details
The following details are displayed when you select a column in the left pane.
- Completeness (%): The percentage of Complete, Null, and Blank detected in the column. Click the hyperlink to view the records in each categories.
-
Uniqueness: These statistics are displayed here:
- Unique: Records with no duplicates in the data source.
- Non-unique: Records having duplicates in the data source.
- Distinct: A list of all records present in your data source irrespective of those being unique or non-unique records.
-
String Analysis: These statistics are displayed here:
- Semantic Type: Detected semantic type. It will not be displayed if the semantic type is not detected in the data.
- Min Length: Minimum length of the string in the column.
- Max Length: Maximum length of the string in the column.
- Frequency Analysis: Distribution of frequency of data value for any type of column. It shows the repetitions of the data value.
- String Length: The distribution of string lengths in the selected string field. String length is the number of characters in a string.
Example for frequency analysis
Roger Gigi Gigi Gigi Garey Elena Brad BradHere:
- Roger, Garey, and Elena are unique records.
- Gigiand Brad are non-unique records.
- Roger, Gigi, Garey, Brad, and Elena are distinct records.
Profiling results for numerical data
- Numerical Analysis: These statistics are displayed here:
- Minimum: It is the minimum value for any numerical data and date.
- Maximum: It is the maximum value for any numerical data and date.
- Standard Deviation: It is a statistic that measures the dispersion of a dataset relative to its mean.
- Variance: It is the average of the squared deviation from its mean.
- Average: It is an average of the numerical values present in the column data.
-
Percentile: It is a value where an observation falls in a range of other
observations. For example, if a score falls in the 30th percentile, this means that 30
percent of all the scores recorded are lower.
For a Percentile chart, the X-axis represents the actual data, and the Y-axis divides the data into 100 parts, representing percentiles from 0 to 100. This remains the same for every percentile chart. When you hover over a point, it displays the exact percentile for that data. For example, if you had a dataset of the publication years of books, and 2008 is at the 75th percentile, it means that 75% of the books were published before 2008, and only 25% were published in or after 2008.
-
Histogram: Represents the distribution of data. For a Histogram chart, the X-axis represents the range of data values, and the Y-axis represents the count or frequency of data. When you hover over a bar, it will display the frequency. For example, if you were analyzing historical population data, the value "115,654" for "1970-1975" on the y-axis would mean that there were 115,654 people, events, or occurrences related to that specific six-year period within your dataset.
Note: The percentile value may change with every profile run. This is because of the Snowflake function APPROX_PERCENTILE used to calculate the percentile, which returns an approximated value.The histogram is not shown if the minimum and maximum values for the numerical column are equal.
Semantic
- Semantic Type: Displays the list of detected semantic type in the selected column.
- Confidence: Displays the confidence level on the detected semantic type. It is the percentage of the surety or the possibility of the data present in the column.
For example, 98% confidence in the phone number means that there is a 98% possibility that the data contains phone numbers.