9. Tale of Data repositories

Repositories let you repair or enrich datasets using sophisticated match algorithms.

9.1. Creating a repository

To create a repository you must create a flow.

A flow containing a repository can contain all the standard processors needed to prepare repository data but can only contain one repository processor that is a particular type of sink .

Repository processors let you create a repository that can be reused and shared to reconcile data.

Example of a flow creating a repository:

9.1.1. Configuring a repository

The name of the repository must be entered in the processor configurator.

A description is optional but is strongly advised because all repository users will be able to see it.

Repositories have three view levels:

Public.

Shared with my organization.

Private.

9.1.2. Search fields

These fields let repository users reconcile their data so it is important when creating a repository to select your search fields carefully.

For example, in a repository of persons, name will naturally be a search field.

To add a search field, click Add a search column in the repository node configurator.

You can configure the search field in the dialogue window that opens:

You must specify the search field (postcode in the example above).

Full text lets you specify whether searches will accept a subset of words in the field.

Fuzzy lets you specify whether typos to be tolerated.

Phonetics lets you specify whether you want searches for words with similar pronunciations (3 options):

French.

English.

No phonetics.

9.1.3. Search groups

Search groups let repository users search for matches between one field in their own dataset and several repository fields. This is useful if, for example, you have a dataset containing a full name (= first name + surname) and want to search for matches in a repository that stores first names and surnames as two separate fields.

In the Create a group search window you can enter the name of the group, its description and its fields. Fuzzy and phonetics options are also available (see Search fields for an explanation of both concepts).

9.1.4. Lists of replacements

These let you configure the word replacements you want to use when reconciling or enriching data.

For example, you can specify that if the word ‘car’ appears in your dataset it must trigger a search for the word ‘vehicle’ in the repository. To do this, just add the following row to the list of replacements:

“car;vehicle”.

Note

Replacements can only be made for whole words (separated from other words). ‘;’ (semi-colon) can be used only to separate text to be replaced from its replacement text.
Replacements are upper/lower case insensitive.
You can specify regular expressions [3] for replacements by starting your row with \x.
- Example: \x(?i)(\d+)(?:B|BIS)\b;$1 bis will change the search request 3b rue de la Gare to 3 bis rue de la Gare.

Caution

To create the repository, you must run the flow.

You can view the repositories that are available in the catalog:

In the following main zones:

Zone for browsing and selecting repositories.

Information zone of the selected repository.

Preview zone for repository data.

9.2. Using repositories

To reconcile data, use the Reconcile / Enrich with repository data transformation in the preparation editor:

A wizard will configure the transformation and let you choose:

The repository with which you want to reconcile your dataset.

The matches between the fields in your dataset and the repository search fields (or groups).

The repository fields with which you want to enrich your dataset (you can choose a field subset and then reorder the fields using drag and drop).

Note

Practical example : Reconcile / Enrich with Repository data

Before Transformation:

CustomerID	Name	Email
1	John Doe	john.doe@example.com
2	Jane Smith	jane.smith@sample.com

Transformation Configuration:

Repository: “Account repository”
Reconciliation Model:
- Match Name column from your dataset with Name column from Account repository. Match strategy: Fuzzy Match (We’d like to avoid missing matches due to spelling mistakes on the name).
Fetched Columns: ProfileID, AccountStatus
Match Count: 1 (only the best match is fetched)

After Transformation:

CustomerID	Name	Email	ProfileID	AccountStatus
1	John Doe	john.doe@example.com	101	Active
2	Jane Smith	jane.smith@sample.com	204	Inactive

In this example, the transformation enriches the original dataset by adding ProfileID and AccountStatus from the Account repository based on the best match found for each customer name.