Discover seamless data transformation and execution in our Data Integrity Suite's data quality pipeline. Learn how to effortlessly enhance data accuracy, rectify errors, and automate processes for reliable insights.
Add transformation steps
Transforms are configured to form transformation steps in a pipeline. These steps change the format, structure, or values of data uploaded by a dataset. A pipeline may perform any number of transforms in the order determined by the order of the transformation steps.
View errors in the Data Quality pipeline
To view errors:
View data errors
Group transformation steps
An ability to group steps in a pipeline. You can add steps to the group or can create groups within groups.
While cleansing complex data with a large number of steps involving various functions, it can be challenging to organize your data quality pipeline. Grouping steps logically can help you efficiently manage the pipeline. You can add a group from Add Step in a data quality pipeline.
You can simplify your pipeline by grouping steps based on their functions. This provides better control and a more focused view of your pipeline, allowing you to easily navigate through the steps.
For example, if you have a dataset containing customer information such as names, contact information, addresses, and you want to clean this data using multiple steps, you may use parsing steps to clean customer information and general steps to amend certain fields. Then, you may want to use addressing steps to gain addressing insights and add location-based insights. Finally, you can use an enrich step to enhance your data. With this approach, there are a significant number of steps involved in curating your dataset, each serving a different function based on the desired result. To simplify your pipeline, you can group steps with similar functions into a single group. These groups provide clarity about the transformations used in the pipeline and help you manage your transformations more efficiently, eliminating the need to scroll horizontally back and forth to navigate through pipeline steps.
You can also create groups with single or multiple conditions. For a conditional group, the steps added to the group will be applied only to the fields that satisfy the group's condition.
For instance, consider a scenario where you are familiar with your dataset and know which
fields require transformations. In this case, when you create a group with a specific
condition, the steps added to that group will be applied exclusively to the fields that
meet the group's condition. You can add multiple parsing steps to a group with a condition
set as Name EQUAL Literal John. In this scenario, all parsing steps in
the group will be applied only to fields with the name John.
You can also select multiple steps and perform actions on a collection of steps rather than performing actions on individual steps one by one.
With a group, you can perform the following actions:
- create multiple groups
- select steps and move them to a group or add them to a new group
- Update or delete groups
- A group must contain at least one step.
- You can double-click to open a group and drill down into a focused state.
- You can use drag-and-drop action to reorder steps or groups.
- The data of previous step or a group is used as the input data for the next step or a group.
- The output preview shown for a group is the preview of last step in the group.
- You cannot group both a step and a group together into another group.
- You cannot group steps that are located on the left and right sides of a group.
Create a new group
Add steps to a group
- Hover over the step within the group and click the Add Step icon.
- Select the step that you want to add.
Select multiple steps or groups
Allows you to select multiple steps or groups to perform a set of action.
Move steps or groups to a group
- On the top-right, click the vertical ellipsis to access available actions.
-
Use drag-and-drop actions to move steps or groups into another group. Ensure the target group is expanded where you intend to place the selected items.
Group existing steps or groups in a pipeline
- Point to the steps or groups that you want to group together and select the checkbox on the top-left corner.
- On the top-right of the canvas, select Group.
Edit a group
- Select the group that you want to edit.
- On the top-right of the canvas, click Edit.
Delete a group
- Select the group that you want to delete.
- On the top-right, click the vertical ellipses and select Delete.
Other available options are:
Depending on your selection—whether it's a step or a group—these actions are displayed. Click the vertical ellipsis at the top-right.
- Add step: Adds a new step.
- Reorder: Moves the selected items as per your selection.
- Select All: Selects all visible items in the pipeline.
- Expand/Collapse Group: It expands or collapses a group to show or hide steps within the group.
- Ungroup: It breaks down the group into its constituent steps.
Apply conditions to transformation step or a group
Apply transformations to targeted fields in the dataset by using conditions within a step or a group.
Set conditions on a transformation step or a group in a data quality pipeline. This is an ideal way to apply the transformation on the fields that satisfy the condition. Make sure you are well versed with your dataset to be able to target the fields that you want to transform. Use conditional transformation to obtain resultant output as per the applied conditions. The output table includes the resultant transformed field including the fields that do not have any conditions selected.
For example, you are using Parse Email and you want to parse emails only for users from a specific state. In this case, you can apply condition on the Parse Email step as State column Equal condition Literal value as New York. Yon can either enter a literal value or select a field. This will output the result as parsing emails only for users having state as New York.
You can also apply conditions to a group of transformation steps. In this case, it will output the result for each transformation based on the condition applied to the group.
- Basic conditions are used when you want to perform straightforward data evaluation based on simple conditions. While basic conditions are efficient for simple tasks, more complex scenarios that involve multiple conditions, logical operations, or advanced data transformations may require the use of advanced conditions or custom queries.
- Advanced conditions are used in more complex data evaluation scenarios when basic conditions are insufficient to meet your specific data transformation needs. These conditions provide greater flexibility and capability to work with complex data.
The choice between basic and advanced conditioning depends on the specific requirements of your data.
- You can generate conditional output for all transformation steps, except these: Copy Field, Filter Field, Rename Field, and Filter Row.
- If a group of transformation steps contains one or more steps that do not support conditions, applying conditions to the group will result in conditional output based on the conditions applied to the group.
- If you have conditions on both, individual steps and on a group of steps, the
group condition will intersect with all individual step conditions using the AND
operator. For example, if the group has the condition
Age = '30'and a step within the group has the conditionFIRSTNAME=ucase('AMY'), the overall condition will return results only forAge = '30' AND FIRSTNAME=ucase('AMY'). - If you switch from typing the advanced condition to the basic condition, your advanced conditions will be lost.
To generate conditional transformation output:
- Create a new pipeline or open an existing one.
- Open the transformation step or the group to which you want to apply the condition.
- On the right pane, select the Condition tab.
- Select the Basic or Advanced condition.
- Select the condition for basic evaluation or type the condition expression for advanced evaluation.
- Click Save to apply your changes.
Basic condition
Basic condition is used when you want to perform data evaluation based on simple conditions.
- Select the combination of
<field>,<operator>,<literal>and type a value that should match the condition. - Select the combination of
<field>,<operator>,<field>and select a field that should match the condition. In this case, you can compare the selected field with field of same data type. You can use this for all available data types.
The available operators depend on the data type of the selected field.
For example, the condition, FirstnameequalJoe, where Firstname is the field, equal is the condition, and the literal value is Joe, will apply the transformation to fields that has Firstname value equal to Joe.
Numeric and temporal field operators
Use numeric operators to evaluate values in Integer, Long, Float, Date, DateTime, and Time fields.
| Operator | Binary/Unary | Expression true when |
|---|---|---|
| Equal | Binary | Field value is equal to a literal or field value. |
| Not equal | Binary | Field value is not equal to a literal or field value. |
| Null | Unary | Field value is null. |
| Not null | Unary | Field value is not null. |
| Greater than | Binary | Field value is greater than a literal or field value. |
| Greater than or equal | Binary | Field value is greater than or equal to a literal or field value. |
| Less than | Binary | Field value is less than a literal or field value. |
| Less than or equal | Binary | Field value is less than or equal to a literal or field value. |
String field operators
Use string operators to evaluate values in textual fields.
| Operator | Binary/Unary | Expression true when |
|---|---|---|
| Equal | Binary | Field string value matches a literal or field value. |
| Not equal | Binary | Field string value does not match a literal or field value. |
| In | Binary | Field string value matches any one of two or more literal values. (Type and press Enter to add each literal value to the box.) |
Logical field operators
| Operator | Binary/Unary | Expression true when |
|---|---|---|
| Equal | Binary | Field value is equal either to a literal True or False value or to the value in another Boolean field. |
| Null | Unary | Field value is null. |
| Not null | Unary | Field value is not null. |
Advanced condition
- When using advanced conditions, it is mandatory to use left and right operands along with the operator.
- Field names are case sensitive while using advanced conditions. Incorrect field
name casing does not flag and error in the expression and will not apply the set
condition.
For example, in
ucase(FirstName)=true, the condition is applied when you haveFirstNamefield in your dataset. Inucase(FIRSTNAME)=true, the condition is not applied if your dataset does not haveFIRSTNAMEfield and also no error is flagged in the expression.
Operators and their description
| Operator | Description |
|---|---|
| = (equals) |
Selects records where a field's value is equal to a specified value. For
example, |
| != (not equal to) |
Selects records where a field's value is not equal to a specified value. |
| > (greater than) |
Selects records where a field's value is greater than a specified value.
For example, |
| < (less than) |
Selects records where a field's value is less than a specified value. |
| >= (greater than or equal to) |
Selects records where a field's value is greater than or equal to a specified value. |
| <= (less than or equal to) |
Selects records where a field's value is less than or equal to a specified value. |
| IN |
Selects records where a field's value matches any value in a specified
list. For example, |
| IS NULL | Selects records where the specified field has a null or undefined
value. For example, Phone Number IS NULL, if you have a
database of customer information and you want to find all records where the
"Phone Number" field is not filled in, you would use the "IS NULL" condition
on the "Phone Number" field. Phone Number IS NULL |
| IS NOT NULL | Selects records where the specified field has a defined (non-null)
value. For example, Email IS NOT NULL, if you want to find
all records in a dataset where the "Email" field has been filled in (i.e.,
it is not null), you would use the "IS NOT NULL" condition on the "Email"
field. |
| AND |
Logical operator that combines multiple conditions and returns results
that satisfy all of them. For example, you may select data where
|
| OR |
Logical operator that combines conditions and returns results that
satisfy at least one of them. For example, you may select data where
|
| ! |
Logical operator that negates a condition, returning results that do not
meet the specified criteria. For example, |
Functions and their description
| Function | Description | Usage | Example |
|---|---|---|---|
| ucase |
Selects a string with uppercase (capital) letters. |
Standardize the case of characters in a text field. |
FIRSTNAME=ucase('AMY')
|
| lcase | Selects a string with lowercase (small) letters. |
Standardize the case of characters in a text field. |
LASTNAME=lcase('stamp')
|
| tcase | Selects a string with title case or with uppercase of the first letter of each word. |
Update the text in a string to title case. |
tcase('Mrs Mary Smith')=Name
|
| trim | Removes leading and trailing spaces from a string. | Clean up data, especially when dealing with user-inputted text that may have unnecessary spaces. |
trim('routley')=LastName
|
| ltrim | Removes leading (left-side) spaces from a string. | Clean data, ensuring consistent formatting by eliminating leading spaces. |
ltrim('AMY')=FirstName
|
| rtrim | Removes trailing (right-side) spaces from a string. | Standardize data by eliminating trailing spaces that might affect comparisons or visual representation. |
rtrim('stamp')=LastName
|
| lpack | Left-packs a string, removing leading spaces and adjusting remaining characters to the left. | Align text to the left while eliminating unnecessary leading spaces, aiding in data formatting. |
lpack('AMY')=FirstName
|
| rpack | Right-packs a string, removing trailing spaces and adjusting remaining characters to the right. | Align text to the right, removing trailing spaces for consistent formatting. |
rpack('stamp')=LastName
|
| mask | Applies a mask to a string, replacing certain characters with specified placeholders. | Helps verify the datatypes present in the generated pattern. For example, if the pattern generated for the input is AAANN, this indicates that A in the pattern contains alphabets and N represents numeric values. |
mask(Country)='AAAANN'
|
| substr |
Extracts a substring from a given string based on specified starting position and length. |
Extract specific portions of text from a longer string. Two ways of
building expression: |
FIRSTNAME=substr('TAmy',2)
|
| concat | Joins two or more strings without space. | Combine information from multiple fields or sources into a single, cohesive string. |
concat('AMY',' ')=FirstName
|
| regexp | Applies a regular expression pattern to a string for pattern matching. | Complex pattern recognition and extraction in textual data. |
regexp(Email,'@.')='@g'
|
| replace | Replaces occurrences of a specified substring or pattern in a string with another substring. | Clean and update data, replacing specific elements with desired values. |
replace('Mircosoft','rco','cro')=Company
|
| split | Splits a string into an array of substrings based on a specified separator character. | Separate of text data into individual components for analysis or processing. |
split('Sweden:BasicCountry',':',0)=Country
|
| length |
Selects a string having the specified length (number of characters). |
Know the size of a string, often used for validation. |
length(CollectionNumber) > 1
|
| to_number | Selects a numeric string having a specified numerical value. | Transform string representations of numbers into numeric formats for mathematical operations. |
to_number('100')=CustomerId
|
| to_int | Selects double value as per specified integer value. | Maintain data consistency and facilitate numerical calculations. |
to_int(22.53)=Currency
|
| occstr | Returns the position of the first occurrence of a specified substring in a string. | Locate specific substrings within text data. |
occstr(Email,'.','L')=3
|
| now | Selects the current date and time. | Provide a dynamic timestamp for real-time data capture or time-sensitive operations. |
now()=Totaltime
|
| isalpha | Selects a string containing only alphabetic characters. | Verify if a text field consists solely of letters. |
isalpha('test')
|
| isnumeric | Selects a string containing only numeric characters. | Facilitate validation of numeric data integrity in text fields. |
isnumeric('100')!=CustomerId
|
| isnull | Selects fields with the absence of any value. | Check fields for missing or unknown data. |
isnull(NAME)
|
| isempty | Selects an empty string. | Check that text fields are not null or empty. |
isempty('')!=PostalCode
|
| contains |
Selects a string containing a specific substring. |
Search for a particular set of characters within a larger string. |
contains('@', Email) != 0
|
| to_date | Selects the specified ISO format date (YYYY-MM-DD). | Transform date representations in string format for date-related operations. |
to_date('1973-06-05') = Date
|
| to_time | Selects the specified ISO format time (HH:MM:SS). | Transform time representations in string format for time-related calculations. |
to_time('09:04:43')=Time
|
| to_datetime | Selects the specified ISO format datetime (YYYY-MM-DDTHH:MM:SS). | Integrate date and time information from string formats into datetime values for comprehensive temporal analysis. |
to_datetime('1973-06-05T09:04:43')=DateTime
|
After applying the condition, click Preview to review the results. If the results meet your expectations, click Save to preserve the conditions for the transformation step. The output will then be generated based on your selected transformation step and conditions.
AI Assist
- Configure any transformation step.
- Once the properties are configured, navigate to Condition and click on the AI Assist icon.
- In the dialog box, you can enter the rule in simple text. For example: Date Created and Date Modified should not differ by more than 3650 days.
- Click on Generate to produce the pass condition.
- Review the output, and it can now be applied to the pipeline condition.
Add multiple inputs to quality pipeline
Adding multiple inputs to a Data Quality pipeline allows you to include multiple datasets from different sources or with different structures. By merging these inputs, you can process and combine the data as needed.
Add multiple inputs while building your Data Quality pipeline. Add it as a simple or complex input. A simple input uses data directly from a database connection and a complex input requires additional steps to process data before it is ready for merging. The multiple inputs are merged either as union or joins. The output step is added for each input and you set up runtime configurations for each output.
To add multiple inputs.
Select columns in pipeline step
There are several methods using the mouse or keyboard to select columns for a pipeline step. When you edit settings for a transform step, use any of these methods to add columns to the Columns box.
Rename step
This procedure describes how to quickly rename a step in a pipeline without editing the step.
- In the quality pipeline, point to the step that you want to rename.
- click the step name.
- Edit the name and press Enter.
Preview step
You can quickly preview the outcome of any step in a pipeline without opening the step settings.
Edit transformation step
Edit a transform settings for a step in a pipeline to change its behavior. You may also reorder or delete transformation steps in a pipeline.
You can drag a transformation step to the left or to the right to change the order in which it is performed. Alternatively, you can click the vertical ellipsis on the top-right and select .