7.20. Tale of Data standard natures specification

The standard natures available in Tale of Data are described below. Their usage appears mainly in the following sections of the platform:

  • in the Mass Data Discovery Module, used for Data Discovery purposes and providing the ability to scan automatically a collection of systems containing tables or data files. A nature is automatically inferred for a column when it seems plausible that it contains data of a certain nature (for example, email addresses or country names). The inference approach is described for each nature in the specifications below.

  • in the preparation function, which allows:

    • visualizing the natures that have been inferred by Tale of Data on certain columns

    • assigning particular natures to certain fields

    • filtering, exploring, and transforming data under the condition that they are valid or not for the nature carried by the relevant columns

  • in the configuration of the validation node, which can allow the separation from other rows of the data whose one or more columns are valid or invalid for the column natures, in order for subsequent operations to be carried out on the valid and invalid datasets separately.

7.20.1. Banking Natures

7.20.1.1. IBAN

Definition

The International Bank Account Number (IBAN) is an internationally agreed upon system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors. An IBAN uniquely identifies the account of a customer at a financial institution.

Inference and validity check

After a simple length check, the validity principles described on https://en.wikipedia.org/wiki/International_Bank_Account_Number are followed. For the standard IBAN nature supported ISO country codes are FR and GB.

Keyword to designate this nature in MDD exports

iban

7.20.2. Telecommunications

7.20.2.1. Email

Definition

An email address consists of:

  • a local part, generally identifying a person (john, Joe.Bloggs, joe123) or a service name (info, sales, postmaster);

  • the separator character @ (the at sign);

  • the server address, usually a domain name identifying the organization (company, association, town hall, university, or individual) hosting the mailbox (example.net, example.com, example.org).

More details are available at https://en.wikipedia.org/wiki/Email_address

Inference and validity check

Email validity primarily relies on:

  • the validity of the characters present (lowercase and symbols) and achieving a field length between the minimum and maximum length.

  • the separation into two parts using the at sign.

  • the a priori validity of the server domain address, for example through its overall format, and also the membership of its Top Level Domain to the list of known TLDs.

Keyword to designate this nature in MDD exports

e_mail

7.20.2.2. IPv4 Address

Definition

IP address version 4. A complete description of this data nature can be found here: https://en.wikipedia.org/wiki/IPv4

Inference and validity check

The validity of the IPv4 address relies on:

  • the length of the field.

  • the presence of four numbers between 0 and 255 separated by periods.

Keyword to designate this nature in MDD exports

ipv4_address

7.20.2.3. Url

Definition

A URL is a uniform character string that identifies a resource on the World Wide Web by its location and specifies the internet protocol to retrieve it (e.g., http or https). A complete description of this data nature can be found here: https://en.wikipedia.org/wiki/URL.

Inference and validity check

The validity check relies on the plausibility of the string length, as well as the presence of the character : near the beginning of the string.

Keyword to designate this nature in MDD exports

url

7.20.3. Geographical Natures

7.20.3.1. Country

Definition

This nature represents the common name of a country. It is very close to the ISO 31661-1 alpha-2 standard. More information is available at https://en.wikipedia.org/wiki/ISO_3166-1.

Inference and validity check

An initial check relies on achieving a valid string length. Then, the candidate form is compared to a list of countries from the ISO 31661-1 alpha-2 standard, omitting diacritics in country names.

Keyword to designate this nature in MDD exports

country_name

7.20.3.2. Country Code (ISO 3166-1 alpha-2)

Definition

This nature represents a country code following the ISO 31661-1 alpha-2 standard. More information is available at https://en.wikipedia.org/wiki/ISO_3166-1.

Inference and validity check

A check is performed on the length, which must be 2 characters. The field is then compared to the database of valid country codes.

Keyword to designate this nature in MDD exports

iso31661alpha2

7.20.3.3. Country Code (ISO 3166-1 alpha-3)

Definition

This nature represents a country code following the ISO 31661-1 alpha-3 standard. More information is available at https://en.wikipedia.org/wiki/ISO_3166-1.

Inference and validity check

A check is performed on the length, which must be 3 characters. The field is then compared to the database of valid country codes.

Keyword to designate this nature in MDD exports

iso31661alpha3

7.20.4. GTIN and ISBN Codes

7.20.4.1. GTIN-8

Definition

Global Trade Item Number of length 8. The Global Trade Item Number (GTIN) is a code that identifies any commercial unit (consumer unit or standard grouping unit…) in an international and unique way. This category corresponds, for example, to codes present on barcodes of everyday consumer products. More information is available at https://en.wikipedia.org/wiki/Global_Trade_Item_Number.

Inference and validity check

The validity check follows the principles explained at http://www.gtin.info

Keyword to designate this nature in MDD exports

gtin.format.gtin_8

7.20.4.2. GTIN-12

Definition

Global Trade Item Number of length 12. The Global Trade Item Number (GTIN) is an international code that uniquely identifies any commercial unit (consumer unit or standard grouping unit…) internationally and uniquely. This category corresponds, for example, to codes present on barcodes of everyday consumer products. More information is available at https://en.wikipedia.org/wiki/Global_Trade_Item_Number.

Inference and validity check

The validity check follows the principles explained on http://www.gtin.info

Keyword to designate this nature in MDD exports

gtin.format.gtin_12

7.20.4.3. GTIN-13

Definition

Global Trade Item Number of length 13. The Global Trade Item Number (GTIN) is a code that identifies any commercial unit (consumer unit or standard grouping unit…) in an international and unique way. This category corresponds, for example, to codes present on barcodes of everyday consumer products. More information is available at https://en.wikipedia.org/wiki/Global_Trade_Item_Number.

Inference and validity check

The validity check follows the principles explained at http://www.gtin.info

Keyword to designate this nature in MDD exports

gtin.format.gtin_13

7.20.4.4. GTIN-14

Definition

Global Trade Item Number of length 14. The Global Trade Item Number (GTIN) is a code that identifies any commercial unit (consumer unit or standard grouping unit…) in an international and unique way. This category corresponds, for example, to codes present on barcodes of everyday consumer products. More information is available at https://en.wikipedia.org/wiki/Global_Trade_Item_Number.

Inference and validity check

The validity check follows the principles explained at http://www.gtin.info

Keyword to designate this nature in MDD exports

gtin.format.gtin_14

7.20.4.5. ISBN-10

Definition

The International Standard Book Number (ISBN) is an internationally recognized number created in 1970, uniquely identifying each edition of each book published after the introduction of the ISBN, regardless of its format. In 2007, the ISBN number changed from 10 to 13 digits for compatibility with the GTIN-13 product code. More information is available at http://en.wikipedia.org/wiki/ISBN.

Inference and validity check

The validity check follows the principles outlined in http://en.wikipedia.org/wiki/ISBN.

Keyword to designate this nature in MDD exports

isbn

7.20.5. Personal data

7.20.5.1. English civility

Definition

Detection of abbreviated civilities in English.

Inference and validity check

Comparison against the references “Miss”, “Ms” and “Mr”.

Keyword to designate this nature in MDD exports

english_civility

7.20.5.2. English long Civility

Definition

Detection of long-form civilities in English.

Inference and validity check

Comparison against the references “Miss”, “Mrs”, “Mister”, “Master” and “Mistress”.

Keyword to designate this nature in MDD exports

english_long_civility

7.20.5.3. English long Gender

Definition

Detection of genders in long format in English.

Inference and validity check

Comparison against the references “Male”, “Female” and “Undefined”.

Keyword to designate this nature in MDD exports

english_long_gender

7.20.5.4. Nationality in English

Definition

Detection of nationalities in English.

Inference and validity check

Comparison against a dictionary of over 200 nationalities (e.g., “Dutch” or “Sudanese”).

Keyword to designate this nature in MDD exports

english_nationality

7.20.5.5. English Gender

Definition

Detection of genders in abbreviated format in English.

Inference and validity check

Comparison against the references “M”, “F” and “U”.

Keyword to designate this nature in MDD exports

english_short_gender

7.20.5.6. First name

Definition

Detection of first names and their possible genders.

Inference and validity check

Comparison against a dictionary of English and French first names (totaling more than 35,000 first names), classified by gender.

Keyword to designate this nature in MDD exports

firstname

7.20.5.7. Full name

Definition

Detection of full names (containing a surname and a first name).

Inference and validity check

Length between 6 and 69 characters, only alphabetic characters. One or two first names and 1 to 5 significant words.

Keyword to designate this nature in MDD exports

fullname

7.20.5.8. Last name

Definition

Detection of last names.

Inference and validity check

Comparison with reference dictionaries of last names and terms not to be considered as a last name.

Keyword to designate this nature in MDD exports

lastname

7.20.5.9. British International Phone Number

Definition

Represents British phone numbers coded in +44, in international format.

Inference and validity check

Check that the string starts with +44 and that the field length is plausible.

Keyword to designate this nature in MDD exports

british_international_phone_number

7.20.5.10. Phone number (E.164 format)

Definition

Represents phone numbers in an international format. The list of applicable international dialing codes is established by the International Telecommunication Union in its UIT-T E.1641 recommendation and its annexes, which are regularly updated. More information is available at https://en.wikipedia.org/wiki/E.164.

Inference and validity check

Check the plausibility of the length. Check the presence and validity of the dialing code.

Keyword to designate this nature in MDD exports

e164_format_phone_number

7.20.5.11. US International Phone Number

Definition

Represents US phone numbers coded in +1, in international format.

Inference and validity check

Check that the string starts with +1 and that the field length is plausible.

Keyword to designate this nature in MDD exports

us_international_phone_number

7.20.5.12. US National Phone Number

Definition

Represents US phone numbers, in national format.

Inference and validity check

Check that the structure corresponds to that of a US phone number in national format, for example with an area code and a valid length.

Keyword to designate this nature in MDD exports

us_national_phone_number

7.20.6. Specifically French natures

7.20.6.1. French Basic Bank Account Number (RIB)

Definition

A French Basic Bank Account Number (BBAN), also known locally as a Relevé d’Identité Bancaire (RIB) is a sequence of digits issued by the bank that uniquely identifies a bank account at the national level. It is given to a debtor or creditor to facilitate bank transfers or direct debits from this account.

Inference and validity check

The check is performed on the number of characters and the nature of the characters that make up the field. More details on the validity principle are available at https://fr.wikipedia.org/wiki/Basic_Bank_Account_Number

Keyword to designate this nature in MDD exports

rib

7.20.6.2. French International Phone Number

Definition

Represents French phone numbers coded in +33, in international format.

Inference and validity check

Check that the string starts with +33 and that the field length is plausible.

Keyword to designate this nature in MDD exports

french_international_phone_number

7.20.6.3. French National Phone Number

Definition

Represents French phone numbers, in national format (e.g., starting with 06 and 10 digits long).

Inference and validity check

Check that the structure corresponds to that of a French phone number in national format.

Keyword to designate this nature in MDD exports

french_national_phone_number

7.20.6.4. French Street Number Complement

Definition

This nature represents a French street number supplement, for example “bis” or “ter”.

Inference and validity check

The admitted values (ignoring case) are:

  • bis

  • ter

  • quater

  • quinquines

Keyword to designate this nature in MDD exports

french_street_number_complement

7.20.6.5. French Street Type

Definition

This nature represents a type of French street, for example “rue” or “avenue”.

Inference and validity check

A comparison is made with a database of known types (for example “Terre-plein”, a total of about sixty) as well as their commonly observed abbreviations (typically 4 to 5 abbreviations per type).

Keyword to designate this nature in MDD exports

french_street_type

7.20.6.6. French Zip Code

Definition

This nature represents a French postal code.

Inference and validity check

A check is performed on the length, which must be 5 characters. The field is then compared to the database of valid postal codes.

Keyword to designate this nature in MDD exports

french_zip_code

7.20.6.7. French Social Security Number

Definition

The social security number in France, officially called numéro d’inscription au répertoire des personnes physiques (abbreviated as NIRPP or more simply NIR), is a numeric code used to uniquely identify a person in the national directory of identification of natural persons (RNIPP) managed by INSEE. More information is available at https://fr.wikipedia.org/wiki/Numéro_de_sécurité_sociale_en_France.

Inference and validity check

The following checks are necessary:

  • field length

  • validity/plausibility of the number’s sub-parts

  • algorithmic validity of the key at the end of the number

Keyword to designate this nature in MDD exports

french_ssn

7.20.6.8. Insee Commune Code

Definition

The municipality code contains five digits or letters (concatenation of the department code and the three-digit coding of the municipality or municipal district in Paris, Lyon, and Marseille, or two digits for overseas municipalities). It is most frequently used in INSEE statistical data and is also used in identity and civil status numbers of individuals (see above). More information is available at https://fr.wikipedia.org/wiki/Code_Insee.

Inference and validity check

A length check and a good membership to the list of valid codes are performed.

Keyword to designate this nature in MDD exports

insee_commune_code

7.20.6.9. French Job Seeker Identifier (Jobseeker ID)

Definition

Composed of 8 to 12 characters, the job seeker identifier is created by France Travail and is unique to each job seeker. More information is available at https://www.netpublic.fr/travail/retrouver-identifiant-pole-emploi/.

Inference and validity check

The check is performed using the string length, the validity of the regional key expected in the first three characters, and the type of characters in the rest of the string.

Keyword to designate this nature in MDD exports

pole_emploi_id

7.20.6.10. SIREN

Definition

In France, the Système d’Identification du Répertoire des ENtreprises, or SIREN number is a unique INSEE code used to identify a company, a public or private organization, a self-employed individual, or an association with activities in France. More information is available at https://fr.wikipedia.org/wiki/Syst%C3%A8me_d%27identification_du_r%C3%A9pertoire_des_entreprises.

Inference and validity check

The check relies on the string length and the numeric nature of its content. An algorithmic verification is performed using the Luhn algorithm.

Keyword to designate this nature in MDD exports

siren

7.20.6.11. SIRET

Definition

The Système d’Identification du Répertoire des ÉTablissements, or SIRET number is an INSEE code used to identify an establishment or a French company. More information is available at https://fr.wikipedia.org/wiki/Syst%C3%A8me_d%27identification_du_r%C3%A9pertoire_des_%C3%A9tablissements.

Inference and validity check

The check relies on the string length and the numeric nature of its content. An algorithmic verification is performed using the Luhn algorithm.

Keyword to designate this nature in MDD exports

siret

7.20.6.12. French long Civility

Definition

Detection of long-form civilities in French.

Inference and validity check

Comparison against the references “Madame”, “Mademoiselle” and “Monsieur”.

Keyword to designate this nature in MDD exports

french_long_civility

7.20.6.13. French long Gender

Definition

Detection of genders in long format in French.

Inference and validity check

Comparison (case-insensitive and accent-insensitive) against the references “feminin”, “masculin”, “femme”, “homme”, “inconnu”, “indefini” for detection. For validation, a capitalized format is checked.

Keyword to designate this nature in MDD exports

french_long_gender

7.20.6.14. Nationality in French

Definition

Detection of nationalities in French.

Inference and validity check

Comparison against a dictionary of over 400 gendered nationalities (e.g., “Émirienne” or “Nicaraguayen”).

Keyword to designate this nature in MDD exports

french_nationality

7.20.6.15. French Civility

Definition

Detection of abbreviated civilities in French.

Inference and validity check

Comparison against the references “Mme”, “Mlle” and “M.”. Ability to detect variants during type inference.

Keyword to designate this nature in MDD exports

french_short_civility

7.20.6.16. French Gender

Definition

Detection of genders in abbreviated format in French.

Inference and validity check

Comparison against the references “H”, “F” and “I”.

Keyword to designate this nature in MDD exports

french_short_gender