4.9. Enrichment node

If you want to learn more about this feature, an e-learning tutorial is available below:

Enrich your data using fuzzy logic

../../_images/N_EN_12.png

This video shows the use of custom fuzzy logic in the enrichment node of Tale of Data.

Custom fuzzy matching in the enrichment node

../../_images/N_EN_12.png

This video shows how to use the custom fuzzy matching tool in the enrichment node within Tale of Data. This allows, for example, to detect duplicates in your data, or to group data together based on approximate matching criteria. You can overcome common issues with manually entered data or data based on verbal information.

4.9.1. Description

Icon: image135

  • Number of inputs: 2.

  • Number of outputs: 1.

Definition

An enrichment node uses fuzzy matching to add new fields to a dataset (“enriched dataset* or dataset 1) from an enriching dataset (= dataset 2, connected by a blue link).

Configuration

At input the enrichment node must be connected to precisely 2 nodes in the following order:

  1. The enriched dataset: to which new fields are to be added.

  2. The enriching dataset (connected by a blue link): this is the dataset that will contribute the new fields.

In the configurator, click Add a match condition image136 to configure each match condition image137 for the enrichment.

Match conditions are treated as a logical AND and must all be met.

Match options image138 are:

  • Strict Equal:

    the match will not be validated unless the values in the two matched cells are the same (the first cell must belong to the enriched dataset and the second to the enriching dataset).

  • Ignore Case & Diatrics:

    the match will not be validated unless the values in the two matched cells are the same, apart from upper/lower case and accented character differences (both matched fields must be text type).

  • French Phonetics:

    the match will not be validated unless the values in the two matched cells are pronounced identically in French (both matched fields must be text type).

  • English Phonetics:

    the match will not be validated unless the values in the two matched cells are pronounced identically in English (both matched fields must be text type).

  • Fuzzy – 1 Difference Max:

    the match will not be validated unless the values in the two matched cells have a Levenshtein [1] distance below or equal to 1 (both matched fields must be text type).

  • Fuzzy – 2 Differences Max :

    the match will not be validated unless the values in the two matched cells have a Levenshtein distance below or equal to 2 (both matched fields must be text type).

  • Nearest:

    this condition, which cannot be used alone, allows two enriching dataset records to be shared so long as the other matching conditions have been met. Priority will be given to the one closest to the values in the two matched cells (both matched columns must be continuous type, either numeric or date).

  • Full text:

    this condition only makes sense when both matched cells contain several words (usually a short text). The match algorithm will match a cell in the enriched dataset with the cell in the enriching dataset that shares the largest number of words with it (both matched fields must be text fields).

Select at least one field in the enriching dataset that is to be retrieved in the enriched dataset image139.

Three additional settings are available when configuring an enrichment node:

  • The confidence score image140 measures the reliability of a fuzzy join.

    This ranges from 0 (= unreliable join because of major differences between the joined fields) to 1 (= very reliable join because all joined fields are identical).

  • Max number of matches image141:

    this is the max number of enriching dataset records matched with an enriched dataset record. For example, if assigned a value of 3, one enriched dataset record can create up to 3 matched records in the result dataset so that the 3 best matches found can be kept.

  • Prefix for retrieved columns image142:

    this option lets you specify a prefix text for all the new columns retrieved after enrichment.

image143

At the bottom of the enrichment node configurator, click Enrichment Statistics image144 for information on data enrichment performance.

image145

The 3 tabs show a sample of enriched dataset records that have one, multiple or no match(es) in the enriching dataset.

Color code:

  • image146 enriched dataset field.

  • image147 enriching dataset field.

Tip

If your enriching dataset is big (several hundred thousand to several hundred million records), use Tale of Data repositories rather than an enrichment node.

Visual Example

image148

In this example, dataset 1 is enriched by dataset 2.

image149

The first record in the table above shows that Jacky Hubert was matched with Jackie Hubert (phonetic match) and that the similarity ratio is 0.8933 (because of the different spellings of Jacky and Jackie).

Practical examples

Note

Example 1: Enriching Customer Data with English Phonetic Matching

Before Operation :

  • Dataset 1 (Enriched Dataset - Products)

Customer ID

Customer Name

Country

001

John Doe

USA

002

Jane Smith

Canada

  • Dataset 2 (Enriching Dataset - Reviews):

Order ID

Full Name

Order Amount

Review Date

O001

John Doe

100

2023-03-01

R002

Jane Smith

200

2023-02-01

Node Configuration:

  • Fuzzy – 2 Differences Max :

  • Match Condition: Match “Product Name” from Dataset 1 with “Reviewed Product” from Dataset 2

  • Fields to Enrich: Add “Review ID”, “Rating”, “Review Date” to Dataset 1

After Operation:

  • Enriched Dataset (Products with Reviews) :

Customer ID

Customer Name

Country

Order ID

Order Amount

Review Date

001

John Doe

USA

O001

100

2023-03-01

002

Jane Smith

Canada

R002

200

2023-02-01

Note

Example 2: Enriching Product Listings Using Fuzzy Matching (Levenshtein Distance 2)

Before Operation :

  • Dataset 1 (Enriched Dataset - Products)

Product ID

Product Name

Price

P001

Portable Speaker

50

P002

Wireless Earbuds

70

  • Dataset 2 (Enriching Dataset - Reviews):

Review ID

Reviewed Product

Rating

Review Date

R001

Portable Speeker

4.5

2023-03-01

R002

Wireless Earbudz

4.7

2023-03-02

Node Configuration:

  • Fuzzy – 2 Differences Max :

  • Match Condition: Match “Product Name” from Dataset 1 with “Reviewed Product” from Dataset 2

  • Fuzzy Matching Parameters: Maximum Differences = 2

  • Fields to Enrich: Add “Review ID”, “Rating”, “Review Date” to Dataset 1

After Operation:

  • Enriched Dataset (Products with Reviews) :

Product ID

Product Name

Price

Review ID

Rating

Review Date

P001

Portable Speaker

50

R001

4.5

2023-03-01

P002

Wireless Earbuds

70

R002

4.7

2023-03-02