7.20. Tale of Data standard natures specification
The standard natures available in Tale of Data are described below. Their usage appears mainly in the following sections of the platform:
in the Mass Data Discovery Module, used for Data Discovery purposes and providing the ability to scan automatically a collection of systems containing tables or data files. A nature is automatically inferred for a column when it seems plausible that it contains data of a certain nature (for example, email addresses or country names). The inference approach is described for each nature in the specifications below.
in the preparation function, which allows:
visualizing the natures that have been inferred by Tale of Data on certain columns
assigning particular natures to certain fields
filtering, exploring, and transforming data under the condition that they are valid or not for the nature carried by the relevant columns
in the configuration of the validation node, which can allow the separation from other rows of the data whose one or more columns are valid or invalid for the column natures, in order for subsequent operations to be carried out on the valid and invalid datasets separately.
7.20.1. Banking Natures
7.20.1.1. IBAN
- Definition
The International Bank Account Number (IBAN) is an internationally agreed upon system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors. An IBAN uniquely identifies the account of a customer at a financial institution.
- Inference and validity check
After a simple length check, the validity principles described on https://en.wikipedia.org/wiki/International_Bank_Account_Number are followed. For the standard IBAN nature supported ISO country codes are FR and GB.
- Keyword to designate this nature in MDD exports
iban
7.20.2. Telecommunications
7.20.2.1. Email
- Definition
An email address consists of:
a local part, generally identifying a person (john, Joe.Bloggs, joe123) or a service name (info, sales, postmaster);
the separator character
@
(the at sign);the server address, usually a domain name identifying the organization (company, association, town hall, university, or individual) hosting the mailbox (example.net, example.com, example.org).
More details are available at https://en.wikipedia.org/wiki/Email_address
- Inference and validity check
Email validity primarily relies on:
the validity of the characters present (lowercase and symbols) and achieving a field length between the minimum and maximum length.
the separation into two parts using the at sign.
the a priori validity of the server domain address, for example through its overall format, and also the membership of its Top Level Domain to the list of known TLDs.
- Keyword to designate this nature in MDD exports
e_mail
7.20.2.2. IPv4 Address
- Definition
IP address version 4. A complete description of this data nature can be found here: https://en.wikipedia.org/wiki/IPv4
- Inference and validity check
The validity of the IPv4 address relies on:
the length of the field.
the presence of four numbers between 0 and 255 separated by periods.
- Keyword to designate this nature in MDD exports
ipv4_address
7.20.2.3. Url
- Definition
A URL is a uniform character string that identifies a resource on the World Wide Web by its location and specifies the internet protocol to retrieve it (e.g., http or https). A complete description of this data nature can be found here: https://en.wikipedia.org/wiki/URL.
- Inference and validity check
The validity check relies on the plausibility of the string length, as well as the presence of the character
:
near the beginning of the string.- Keyword to designate this nature in MDD exports
url
7.20.3. Geographical Natures
7.20.3.1. Country
- Definition
This nature represents the common name of a country. It is very close to the ISO 31661-1 alpha-2 standard. More information is available at https://en.wikipedia.org/wiki/ISO_3166-1.
- Inference and validity check
An initial check relies on achieving a valid string length. Then, the candidate form is compared to a list of countries from the ISO 31661-1 alpha-2 standard, omitting diacritics in country names.
- Keyword to designate this nature in MDD exports
country_name
7.20.3.2. Country Code (ISO 3166-1 alpha-2)
- Definition
This nature represents a country code following the ISO 31661-1 alpha-2 standard. More information is available at https://en.wikipedia.org/wiki/ISO_3166-1.
- Inference and validity check
A check is performed on the length, which must be 2 characters. The field is then compared to the database of valid country codes.
- Keyword to designate this nature in MDD exports
iso31661alpha2
7.20.3.3. Country Code (ISO 3166-1 alpha-3)
- Definition
This nature represents a country code following the ISO 31661-1 alpha-3 standard. More information is available at https://en.wikipedia.org/wiki/ISO_3166-1.
- Inference and validity check
A check is performed on the length, which must be 3 characters. The field is then compared to the database of valid country codes.
- Keyword to designate this nature in MDD exports
iso31661alpha3
7.20.4. GTIN and ISBN Codes
7.20.4.1. GTIN-8
- Definition
Global Trade Item Number of length 8. The Global Trade Item Number (GTIN) is a code that identifies any commercial unit (consumer unit or standard grouping unit…) in an international and unique way. This category corresponds, for example, to codes present on barcodes of everyday consumer products. More information is available at https://en.wikipedia.org/wiki/Global_Trade_Item_Number.
- Inference and validity check
The validity check follows the principles explained at http://www.gtin.info
- Keyword to designate this nature in MDD exports
gtin.format.gtin_8
7.20.4.2. GTIN-12
- Definition
Global Trade Item Number of length 12. The Global Trade Item Number (GTIN) is an international code that uniquely identifies any commercial unit (consumer unit or standard grouping unit…) internationally and uniquely. This category corresponds, for example, to codes present on barcodes of everyday consumer products. More information is available at https://en.wikipedia.org/wiki/Global_Trade_Item_Number.
- Inference and validity check
The validity check follows the principles explained on http://www.gtin.info
- Keyword to designate this nature in MDD exports
gtin.format.gtin_12
7.20.4.3. GTIN-13
- Definition
Global Trade Item Number of length 13. The Global Trade Item Number (GTIN) is a code that identifies any commercial unit (consumer unit or standard grouping unit…) in an international and unique way. This category corresponds, for example, to codes present on barcodes of everyday consumer products. More information is available at https://en.wikipedia.org/wiki/Global_Trade_Item_Number.
- Inference and validity check
The validity check follows the principles explained at http://www.gtin.info
- Keyword to designate this nature in MDD exports
gtin.format.gtin_13
7.20.4.4. GTIN-14
- Definition
Global Trade Item Number of length 14. The Global Trade Item Number (GTIN) is a code that identifies any commercial unit (consumer unit or standard grouping unit…) in an international and unique way. This category corresponds, for example, to codes present on barcodes of everyday consumer products. More information is available at https://en.wikipedia.org/wiki/Global_Trade_Item_Number.
- Inference and validity check
The validity check follows the principles explained at http://www.gtin.info
- Keyword to designate this nature in MDD exports
gtin.format.gtin_14
7.20.4.5. ISBN-10
- Definition
The International Standard Book Number (ISBN) is an internationally recognized number created in 1970, uniquely identifying each edition of each book published after the introduction of the ISBN, regardless of its format. In 2007, the ISBN number changed from 10 to 13 digits for compatibility with the GTIN-13 product code. More information is available at http://en.wikipedia.org/wiki/ISBN.
- Inference and validity check
The validity check follows the principles outlined in http://en.wikipedia.org/wiki/ISBN.
- Keyword to designate this nature in MDD exports
isbn
7.20.5. Personal data
7.20.5.1. English civility
- Definition
Detection of abbreviated civilities in English.
- Inference and validity check
Comparison against the references “Miss”, “Ms” and “Mr”.
- Keyword to designate this nature in MDD exports
english_civility
7.20.5.2. English long Civility
- Definition
Detection of long-form civilities in English.
- Inference and validity check
Comparison against the references “Miss”, “Mrs”, “Mister”, “Master” and “Mistress”.
- Keyword to designate this nature in MDD exports
english_long_civility
7.20.5.3. English long Gender
- Definition
Detection of genders in long format in English.
- Inference and validity check
Comparison against the references “Male”, “Female” and “Undefined”.
- Keyword to designate this nature in MDD exports
english_long_gender
7.20.5.4. Nationality in English
- Definition
Detection of nationalities in English.
- Inference and validity check
Comparison against a dictionary of over 200 nationalities (e.g., “Dutch” or “Sudanese”).
- Keyword to designate this nature in MDD exports
english_nationality
7.20.5.5. English Gender
- Definition
Detection of genders in abbreviated format in English.
- Inference and validity check
Comparison against the references “M”, “F” and “U”.
- Keyword to designate this nature in MDD exports
english_short_gender
7.20.5.6. First name
- Definition
Detection of first names and their possible genders.
- Inference and validity check
Comparison against a dictionary of English and French first names (totaling more than 35,000 first names), classified by gender.
- Keyword to designate this nature in MDD exports
firstname
7.20.5.7. Full name
- Definition
Detection of full names (containing a surname and a first name).
- Inference and validity check
Length between 6 and 69 characters, only alphabetic characters. One or two first names and 1 to 5 significant words.
- Keyword to designate this nature in MDD exports
fullname
7.20.5.8. Last name
- Definition
Detection of last names.
- Inference and validity check
Comparison with reference dictionaries of last names and terms not to be considered as a last name.
- Keyword to designate this nature in MDD exports
lastname
7.20.5.9. British International Phone Number
- Definition
Represents British phone numbers coded in +44, in international format.
- Inference and validity check
Check that the string starts with +44 and that the field length is plausible.
- Keyword to designate this nature in MDD exports
british_international_phone_number
7.20.5.10. Phone number (E.164 format)
- Definition
Represents phone numbers in an international format. The list of applicable international dialing codes is established by the International Telecommunication Union in its UIT-T E.1641 recommendation and its annexes, which are regularly updated. More information is available at https://en.wikipedia.org/wiki/E.164.
- Inference and validity check
Check the plausibility of the length. Check the presence and validity of the dialing code.
- Keyword to designate this nature in MDD exports
e164_format_phone_number
7.20.5.11. US International Phone Number
- Definition
Represents US phone numbers coded in +1, in international format.
- Inference and validity check
Check that the string starts with +1 and that the field length is plausible.
- Keyword to designate this nature in MDD exports
us_international_phone_number
7.20.5.12. US National Phone Number
- Definition
Represents US phone numbers, in national format.
- Inference and validity check
Check that the structure corresponds to that of a US phone number in national format, for example with an area code and a valid length.
- Keyword to designate this nature in MDD exports
us_national_phone_number
7.20.6. Specifically French natures
7.20.6.1. French Basic Bank Account Number (RIB)
- Definition
A French Basic Bank Account Number (BBAN), also known locally as a Relevé d’Identité Bancaire (RIB) is a sequence of digits issued by the bank that uniquely identifies a bank account at the national level. It is given to a debtor or creditor to facilitate bank transfers or direct debits from this account.
- Inference and validity check
The check is performed on the number of characters and the nature of the characters that make up the field. More details on the validity principle are available at https://fr.wikipedia.org/wiki/Basic_Bank_Account_Number
- Keyword to designate this nature in MDD exports
rib
7.20.6.2. French International Phone Number
- Definition
Represents French phone numbers coded in +33, in international format.
- Inference and validity check
Check that the string starts with +33 and that the field length is plausible.
- Keyword to designate this nature in MDD exports
french_international_phone_number
7.20.6.3. French National Phone Number
- Definition
Represents French phone numbers, in national format (e.g., starting with 06 and 10 digits long).
- Inference and validity check
Check that the structure corresponds to that of a French phone number in national format.
- Keyword to designate this nature in MDD exports
french_national_phone_number
7.20.6.4. French Street Number Complement
- Definition
This nature represents a French street number supplement, for example “bis” or “ter”.
- Inference and validity check
The admitted values (ignoring case) are:
bis
ter
quater
quinquines
- Keyword to designate this nature in MDD exports
french_street_number_complement
7.20.6.5. French Street Type
- Definition
This nature represents a type of French street, for example “rue” or “avenue”.
- Inference and validity check
A comparison is made with a database of known types (for example “Terre-plein”, a total of about sixty) as well as their commonly observed abbreviations (typically 4 to 5 abbreviations per type).
- Keyword to designate this nature in MDD exports
french_street_type
7.20.6.6. French Zip Code
- Definition
This nature represents a French postal code.
- Inference and validity check
A check is performed on the length, which must be 5 characters. The field is then compared to the database of valid postal codes.
- Keyword to designate this nature in MDD exports
french_zip_code
7.20.6.8. Insee Commune Code
- Definition
The municipality code contains five digits or letters (concatenation of the department code and the three-digit coding of the municipality or municipal district in Paris, Lyon, and Marseille, or two digits for overseas municipalities). It is most frequently used in INSEE statistical data and is also used in identity and civil status numbers of individuals (see above). More information is available at https://fr.wikipedia.org/wiki/Code_Insee.
- Inference and validity check
A length check and a good membership to the list of valid codes are performed.
- Keyword to designate this nature in MDD exports
insee_commune_code
7.20.6.9. French Job Seeker Identifier (Jobseeker ID)
- Definition
Composed of 8 to 12 characters, the job seeker identifier is created by France Travail and is unique to each job seeker. More information is available at https://www.netpublic.fr/travail/retrouver-identifiant-pole-emploi/.
- Inference and validity check
The check is performed using the string length, the validity of the regional key expected in the first three characters, and the type of characters in the rest of the string.
- Keyword to designate this nature in MDD exports
pole_emploi_id
7.20.6.10. SIREN
- Definition
In France, the Système d’Identification du Répertoire des ENtreprises, or SIREN number is a unique INSEE code used to identify a company, a public or private organization, a self-employed individual, or an association with activities in France. More information is available at https://fr.wikipedia.org/wiki/Syst%C3%A8me_d%27identification_du_r%C3%A9pertoire_des_entreprises.
- Inference and validity check
The check relies on the string length and the numeric nature of its content. An algorithmic verification is performed using the Luhn algorithm.
- Keyword to designate this nature in MDD exports
siren
7.20.6.11. SIRET
- Definition
The Système d’Identification du Répertoire des ÉTablissements, or SIRET number is an INSEE code used to identify an establishment or a French company. More information is available at https://fr.wikipedia.org/wiki/Syst%C3%A8me_d%27identification_du_r%C3%A9pertoire_des_%C3%A9tablissements.
- Inference and validity check
The check relies on the string length and the numeric nature of its content. An algorithmic verification is performed using the Luhn algorithm.
- Keyword to designate this nature in MDD exports
siret
7.20.6.12. French long Civility
- Definition
Detection of long-form civilities in French.
- Inference and validity check
Comparison against the references “Madame”, “Mademoiselle” and “Monsieur”.
- Keyword to designate this nature in MDD exports
french_long_civility
7.20.6.13. French long Gender
- Definition
Detection of genders in long format in French.
- Inference and validity check
Comparison (case-insensitive and accent-insensitive) against the references “feminin”, “masculin”, “femme”, “homme”, “inconnu”, “indefini” for detection. For validation, a capitalized format is checked.
- Keyword to designate this nature in MDD exports
french_long_gender
7.20.6.14. Nationality in French
- Definition
Detection of nationalities in French.
- Inference and validity check
Comparison against a dictionary of over 400 gendered nationalities (e.g., “Émirienne” or “Nicaraguayen”).
- Keyword to designate this nature in MDD exports
french_nationality
7.20.6.15. French Civility
- Definition
Detection of abbreviated civilities in French.
- Inference and validity check
Comparison against the references “Mme”, “Mlle” and “M.”. Ability to detect variants during type inference.
- Keyword to designate this nature in MDD exports
french_short_civility
7.20.6.16. French Gender
- Definition
Detection of genders in abbreviated format in French.
- Inference and validity check
Comparison against the references “H”, “F” and “I”.
- Keyword to designate this nature in MDD exports
french_short_gender