9. Tale of Data repositories

Repositories let you repair or enrich datasets using sophisticated match algorithms.

9.1. Creating a repository

To create a repository you must create a flow.

A flow containing a repository can contain all the standard processors needed to prepare repository data but can only contain one repository processor that is a particular type of sink image246.

Repository processors let you create a repository that can be reused and shared to reconcile data.

Example of a flow creating a repository:

image247

9.1.1. Configuring a repository

The name of the repository must be entered in the processor configurator.

A description is optional but is strongly advised because all repository users will be able to see it.

Repositories have three view levels:

  • Public.

  • Shared with my organization.

  • Private.

9.1.2. Search fields

These fields let repository users reconcile their data so it is important when creating a repository to select your search fields carefully.

For example, in a repository of persons, name will naturally be a search field.

To add a search field, click Add a search column image248 in the repository node configurator.

image249

You can configure the search field in the dialogue window that opens:

image250

You must specify the search field (postcode in the example above).

  • Full text lets you specify whether searches will accept a subset of words in the field.

  • Fuzzy lets you specify whether typos to be tolerated.

  • Phonetics lets you specify whether you want searches for words with similar pronunciations (3 options):

    • French.

    • English.

    • No phonetics.

9.1.3. Search groups

Search groups let repository users search for matches between one field in their own dataset and several repository fields. This is useful if, for example, you have a dataset containing a full name (= first name + surname) and want to search for matches in a repository that stores first names and surnames as two separate fields.

In the Create a group search window you can enter the name of the group, its description and its fields. Fuzzy and phonetics options are also available (see Search fields for an explanation of both concepts).

image251

9.1.4. Lists of replacements

These let you configure the word replacements you want to use when reconciling or enriching data.

For example, you can specify that if the word ‘car’ appears in your dataset it must trigger a search for the word ‘vehicle’ in the repository. To do this, just add the following row to the list of replacements:

“car;vehicle”.

Note

  • Replacements can only be made for whole words (separated from other words). ‘;’ (semi-colon) can be used only to separate text to be replaced from its replacement text.

  • Replacements are upper/lower case insensitive.

  • You can specify regular expressions [3] for replacements by starting your row with \x.

    • Example: \x(?i)(\d+)(?:B|BIS)\b;$1 bis will change the search request 3b rue de la Gare to 3 bis rue de la Gare.

Caution

To create the repository, you must run the flow.

You can view the repositories that are available in the catalog:

image252

In the following main zones:

  • Zone for browsing image253 and selecting repositories.

  • Information zone image254 of the selected repository.

  • Preview zone image255 for repository data.

9.2. Using repositories

To reconcile data, use the Reconcile / Enrich with repository data transformation in the preparation editor:

image256

A wizard will configure the transformation and let you choose:

  • The repository with which you want to reconcile your dataset.

  • The matches between the fields in your dataset and the repository search fields (or groups).

  • The repository fields with which you want to enrich your dataset (you can choose a field subset and then reorder the fields using drag and drop).