4.9. Enrichment node
4.9.1. Description
Number of inputs: 2.
Number of outputs: 1.
- Definition
An enrichment node uses fuzzy matching to add new fields to a dataset (“enriched dataset* or dataset 1) from an enriching dataset (= dataset 2, connected by a blue link).
- Configuration
At input the enrichment node must be connected to precisely 2 nodes in the following order:
The enriched dataset: to which new fields are to be added.
The enriching dataset (connected by a blue link): this is the dataset that will contribute the new fields.
In the configurator, click Add a match condition to configure each match condition for the enrichment.
Match conditions are treated as a logical AND and must all be met.
Strict equality:
the match will not be validated unless the values in the two matched cells are the same (the first cell must belong to the enriched dataset and the second to the enriching dataset).
Ignore case and accents:
the match will not be validated unless the values in the two matched cells are the same, apart from upper/lower case and accented character differences (both matched fields must be text type).
French phonetics:
the match will not be validated unless the values in the two matched cells are pronounced identically in French (both matched fields must be text type).
English phonetics:
the match will not be validated unless the values in the two matched cells are pronounced identically in English (both matched fields must be text type).
Fuzzy – max 1 difference:
the match will not be validated unless the values in the two matched cells have a Levenshtein [1] distance below or equal to 1 (both matched fields must be text type).
Fuzzy – max 2 differences :
the match will not be validated unless the values in the two matched cells have a Levenshtein distance below or equal to 2 (both matched fields must be text type).
Closest:
this condition, which cannot be used alone, allows two enriching dataset records to be shared so long as the other matching conditions have been met. Priority will be given to the one closest to the values in the two matched cells (both matched columns must be continuous type, either numeric or date).
Full text:
this condition only makes sense when both matched cells contain several words (usually a short text). The match algorithm will match a cell in the enriched dataset with the cell in the enriching dataset that shares the largest number of words with it (both matched fields must be text fields).
Select at least one field in the enriching dataset that is to be retrieved in the enriched dataset .
Three additional settings are available when configuring an enrichment node:
The confidence score measures the reliability of a fuzzy join.
This ranges from 0 (= unreliable join because of major differences between the joined fields) to 1 (= very reliable join because all joined fields are identical).
-
this is the max number of enriching dataset records matched with an enriched dataset record. For example, if assigned a value of 3, one enriched dataset record can create up to 3 matched records in the result dataset so that the 3 best matches found can be kept.
Prefix for retrieved columns :
this option lets you specify a prefix text for all the new columns retrieved after enrichment.
At the bottom of the enrichment node configurator, click Enrichment Statistics for information on data enrichment performance.
The 3 tabs show a sample of enriched dataset records that have one, multiple or no match(es) in the enriching dataset.
Color code:
Tip
If your enriching dataset is big (several hundred thousand to several hundred million records), use Tale of Data repositories rather than an enrichment node.
4.9.2. Example
In this example, dataset 1 is enriched by dataset 2.
The first record in the table above shows that Jacky Hubert was matched with Jackie Hubert (phonetic match) and that the similarity ratio is 0.8933 (because of the different spellings of Jacky and Jackie).