When you configure Match and Group step settings, you can select from exact, fuzzy, phonetic, and numeric algorithms.
Different data matching algorithms are used depending on the nature of the data to be compared. Data matching algorithms are categorized as exact, fuzzy, phonetic, or numeric.
- Exact: Rejects any differences and finds full matches of character to character.
-
Fuzzy: Tolerates difference and uses probabilistic matching to
evaluate the likelihood that two strings are similar. These include industry leading
techniques such as
Levenshtein Distance(Edit Distance),Jaro-Winkler Distance, andMetaphone 3. - Phonetic: Matches words by their pronunciation. This is useful for matching similar strings. Phonetic strings can be exact or fuzzy.
- Numeric: Used to run a probabilistic match on numeric fields.
Algorithms supported by the Match and Group step
| Algorithm | Category | Description |
|---|---|---|
| Acronym | String | Determines whether a business name matches its acronym by looking for acronym data, else it creates an acronym using the first character of every word. |
| Character Frequency | String | Determines the frequency of occurrence of each character in a string and compares the overall frequencies between two strings. |
| Consonant | Exact |
Only consonants are compared. Vowels are removed from the comparison. It returns a match if consonants from two values match exactly. Results may not be accurate if the data contains multibyte characters.
|
| Daitch-Mokotoff Soundex | Phonetic | Phonetic algorithm that allows greater accuracy in matching of Slavic and Yiddish surnames with similar pronunciation but differences in spelling. Coded names are six digits long, and multiple possible encodings can be returned for a single name. This option was developed to respond to limitations of Soundex in the processing of Germanic or Slavic surnames. |
| Date | Date |
Compare date fields regardless of the date format in the input records. Click Edit in the Options column to specify the following:
|
| Double Metaphone | Fuzzy |
Determines the similarity between two strings based on a phonetic representation of their characters. Double Metaphone is an improved version of the Metaphone algorithm, and attempts to account for the many irregularities found in different languages. Metaphone3 improves upon this algorithm. |
| Edit Distance | Similarity and Distance | Determines the similarity between two strings based on the number of deletions, insertions, or substitutions required to transform one string into another. |
| Euclidean Distance | Similarity and Distance | Provides a similarity measure between two strings using the vector space of combined terms as the dimensions. It also determines the greatest common divisor of two integers. It takes a pair of positive integers and forms a new pair that consists of the smaller number and the difference between the larger and smaller numbers. The process repeats until the numbers are equal. That number then is the greatest common divisor of the original pair. For example, 21 is the greatest common divisor of 252 and 105: (252 = 12 × 21; 105 = 5 × 21); since 252 − 105 = (12 − 5) × 21 = 147, the GCD of 147 and 105 is also 21. |
| Exact match | String | Determines if two strings are the same. |
| Initials | String | Used to match initials for parsed personal names. |
| Jaro-Winkler Distance | Similarity and Distance | Determines the similarity between two strings based on the number of character replacements it takes to transform one string into another. This option was developed for short strings, such as personal names. |
| Keyboard Distance | Similarity and Distance | Determines the similarity between two strings based on the number of deletions, insertions, or substitutions required to transform one string to the other, weighted by the position of the keys on the keyboard. Click Edit in the Options column to specify the type of keyboard you are using: QWERTY (U.S.), QWERTZ (Austria and Germany), or AZERTY (France). |
| Koeln | Phonetic | Indexes names by sound as they are pronounced in German. Allows names with the same pronunciation to be encoded to the same representation so that they can be matched, despite minor differences in spelling. The result is always a sequence of numbers; special characters and white spaces are ignored. This option was developed to respond to limitations of Soundex. |
| Kullback-Leibler Distance | Similarity and Distance | Determines the similarity between two strings based on the differences between the — of words in the two strings. |
| Metaphone | Fuzzy |
Determines the similarity between two English-language strings based on a phonetic representation of their characters. This option was developed to respond to limitations of Soundex. |
| Metaphone (Spanish) | Phonetic | It determines the similarity between two strings based on a phonetic representation of their characters. This option was developed to respond to the limitations of Soundex. |
| Metaphone 3 | Fuzzy | Improves upon the Metaphone and Double Metaphone algorithms with more exact consonant and internal vowel settings that allow you to produce words or names more or less closely matched to search terms on a phonetic basis. Metaphone 3 increases the accuracy of phonetic encoding to 98% by allowing for differences in spelling due to dialects or pronunciation. This algorithm was developed to respond to limitations of Soundex. |
| NGram Distance | Similarity and Distance | Calculates in text or speech the probability of the next term based on the previous n terms, which can include phonemes, syllables, letters, words, or base pairs and can consist of any combination of letters. This algorithm includes an option to enter the size of the NGram; the default is 2. |
| NGram Similarity | Similarity and Distance |
Determines similarity between two strings based on the length of the longest common subsequence of phonemes, syllables, letters, words or base pairs. The algorithm includes the following options:
|
| Numeric String | String |
Compares address lines by separating the numerical attributes of an address line from the characters. For example, in the string address 1234 Main Street Apt 567, the numerical attributes of the string (1234567) are parsed and handled differently from the remaining string value (Main Street Apt). The algorithm first matches numeric data in the string with the numeric algorithm. If the numeric data match is 100, the alphabetic data is matched using Edit distance and Character Frequency. The final match score is calculated as follows:
|
| Spanish Metaphone | Fuzzy | Determines the similarity between two Spanish-language strings based on a phonetic representation of their characters. This option was developed to respond to limitations of Soundex. |
| Nysiis | Phonetic | Phonetic code algorithm that matches an approximate
pronunciation to an exact spelling and indexes words that are pronounced
similarly. Part of the New York State Identification and Intelligence System.
Say, for example, that you are looking for someone's information in a database of people. You believe that the person's name sounds like "John Smith", but it is in fact spelled "Jon Smath". If you conducted a search looking for an exact match for "John Smith" no results would be returned. However, if you index the database using the NYSIIS algorithm and search using the NYSIIS algorithm again, the correct match will be returned because both "John Smith" and "Jon Smath" are indexed as "JANSNATH" by the algorithm. This option was developed to respond to limitations of Soundex; it handles some multicharacter n-grams and maintains relative vowel positioning, whereas Soundex does not. Note: This algorithm does not process non-alpha characters;
records containing them will fail during processing.
|
| Phonix | Phonetic |
The Phonix algorithm is a Soundex variant. While the Soundex phonetic property is restricted to the collection of similar sounding consonants into different classes, the algorithm for computing the Phonix codes uses elaborate substitution rules Preprocesses name strings by applying more than 100 rules to single characters or to sequences of several characters. 19 of those rules are applied only if the character or characters are at the beginning of the string, while 12 of the rules are applied only if they are at the middle of the string, and 28 of the rules are applied only if they are at the end of the string. The name string is encoded into a code that is comprised by a starting letter followed by three digits (removing zeros and duplicate numbers). This is more sophisticated than Soundex. It is also more complex and therefore slower than Soundex. |
| Sonnex | Phonetic | Determines similarity for two French-language
strings based on a phonetic representation of their characters. It returns a Sonnex coded key of the selected fields. |
| Soundex | Phonetic |
Determines the similarity of two strings based on a phonetic representation of their characters. This is less sophisticated than Phonix. |
| SubString | Exact |
Compares two values based on a substring within the values. This determining the similarity based on a substring, particularly where those values contain long strings of characters, or many words.
|
| Syllable Alignment | Phonetic |
Combines phonetic information with edit distance-based calculations. Converts the strings to be compared into their corresponding sequences of syllables and calculates the number of edits required to convert one sequence of syllables to the other. |