4.9. Enrichment node

4.9.1. Description

Icon: image135

  • Number of inputs: 2.

  • Number of outputs: 1.

Definition

An enrichment node uses fuzzy matching to add new fields to a dataset (“enriched dataset* or dataset 1) from an enriching dataset (= dataset 2, connected by a blue link).

Configuration

At input the enrichment node must be connected to precisely 2 nodes in the following order:

  1. The enriched dataset: to which new fields are to be added.

  2. The enriching dataset (connected by a blue link): this is the dataset that will contribute the new fields.

In the configurator, click Add a match condition image136 to configure each match condition image137 for the enrichment.

Match conditions are treated as a logical AND and must all be met.

Match options image138 are:

  • Strict equality:

    the match will not be validated unless the values in the two matched cells are the same (the first cell must belong to the enriched dataset and the second to the enriching dataset).

  • Ignore case and accents:

    the match will not be validated unless the values in the two matched cells are the same, apart from upper/lower case and accented character differences (both matched fields must be text type).

  • French phonetics:

    the match will not be validated unless the values in the two matched cells are pronounced identically in French (both matched fields must be text type).

  • English phonetics:

    the match will not be validated unless the values in the two matched cells are pronounced identically in English (both matched fields must be text type).

  • Fuzzy – max 1 difference:

    the match will not be validated unless the values in the two matched cells have a Levenshtein [1] distance below or equal to 1 (both matched fields must be text type).

  • Fuzzy – max 2 differences :

    the match will not be validated unless the values in the two matched cells have a Levenshtein distance below or equal to 2 (both matched fields must be text type).

  • Closest:

    this condition, which cannot be used alone, allows two enriching dataset records to be shared so long as the other matching conditions have been met. Priority will be given to the one closest to the values in the two matched cells (both matched columns must be continuous type, either numeric or date).

  • Full text:

    this condition only makes sense when both matched cells contain several words (usually a short text). The match algorithm will match a cell in the enriched dataset with the cell in the enriching dataset that shares the largest number of words with it (both matched fields must be text fields).

Select at least one field in the enriching dataset that is to be retrieved in the enriched dataset image139.

Three additional settings are available when configuring an enrichment node:

  • The confidence score image140 measures the reliability of a fuzzy join.

    This ranges from 0 (= unreliable join because of major differences between the joined fields) to 1 (= very reliable join because all joined fields are identical).

  • Max number of matches image141:

    this is the max number of enriching dataset records matched with an enriched dataset record. For example, if assigned a value of 3, one enriched dataset record can create up to 3 matched records in the result dataset so that the 3 best matches found can be kept.

  • Prefix for retrieved columns image142:

    this option lets you specify a prefix text for all the new columns retrieved after enrichment.

image143

At the bottom of the enrichment node configurator, click Enrichment Statistics image144 for information on data enrichment performance.

image145

The 3 tabs show a sample of enriched dataset records that have one, multiple or no match(es) in the enriching dataset.

Color code:

  • image146 enriched dataset field.

  • image147 enriching dataset field.

Tip

If your enriching dataset is big (several hundred thousand to several hundred million records), use Tale of Data repositories rather than an enrichment node.

4.9.2. Example

image148

In this example, dataset 1 is enriched by dataset 2.

image149

The first record in the table above shows that Jacky Hubert was matched with Jackie Hubert (phonetic match) and that the similarity ratio is 0.8933 (because of the different spellings of Jacky and Jackie).