9. Tale of Data repositories
Repositories let you repair or enrich datasets using sophisticated match algorithms.
9.1. Creating a repository
To create a repository you must create a flow.
A flow containing a repository can contain all the standard processors needed to prepare repository data but can only contain one repository processor that is a particular type of sink .
Repository processors let you create a repository that can be reused and shared to reconcile data.
Example of a flow creating a repository:
9.1.1. Configuring a repository
The name of the repository must be entered in the processor configurator.
A description is optional but is strongly advised because all repository users will be able to see it.
Repositories have three view levels:
Public.
Shared with my organization.
Private.
9.1.2. Search fields
These fields let repository users reconcile their data so it is important when creating a repository to select your search fields carefully.
For example, in a repository of persons, name will naturally be a search field.
To add a search field, click Add a search column in the repository node configurator.
You can configure the search field in the dialogue window that opens:
You must specify the search field (postcode in the example above).
Full text lets you specify whether searches will accept a subset of words in the field.
Fuzzy lets you specify whether typos to be tolerated.
Phonetics lets you specify whether you want searches for words with similar pronunciations (3 options):
French.
English.
No phonetics.
9.1.3. Search groups
Search groups let repository users search for matches between one field in their own dataset and several repository fields. This is useful if, for example, you have a dataset containing a full name (= first name + surname) and want to search for matches in a repository that stores first names and surnames as two separate fields.
In the Create a group search window you can enter the name of the group, its description and its fields. Fuzzy and phonetics options are also available (see Search fields for an explanation of both concepts).
9.1.4. Lists of replacements
These let you configure the word replacements you want to use when reconciling or enriching data.
For example, you can specify that if the word ‘car’ appears in your dataset it must trigger a search for the word ‘vehicle’ in the repository. To do this, just add the following row to the list of replacements:
“car;vehicle”.
Note
Replacements can only be made for whole words (separated from other words). ‘;’ (semi-colon) can be used only to separate text to be replaced from its replacement text.
Replacements are upper/lower case insensitive.
You can specify regular expressions [3] for replacements by starting your row with \x.
Example: \x(?i)(\d+)(?:B|BIS)\b;$1 bis will change the search request 3b rue de la Gare to 3 bis rue de la Gare.
Caution
To create the repository, you must run the flow.
You can view the repositories that are available in the catalog:
In the following main zones:
9.2. Using repositories
To reconcile data, use the Reconcile / Enrich with repository data transformation in the preparation editor:
A wizard will configure the transformation and let you choose:
The repository with which you want to reconcile your dataset.
The matches between the fields in your dataset and the repository search fields (or groups).
The repository fields with which you want to enrich your dataset (you can choose a field subset and then reorder the fields using drag and drop).