12. Data Discovery and Data Observability (MDD module)

In Tale of Data, the Data Discovery module (MDD : Mass Data Discovery) and the Data Observability module are coupled.

These two modules allow for analyses, and the configuration of automated monitoring, on a large amount of datasets.

It is possible to analyze all or part of the datasets :

  • of a database server,

  • of a cloud data platform ( ex : Snowflake, Databricks,…),

  • from a file system, local or cloud (all files containing structured data will be analyzed: Excel, CSV, XML, Parquet, etc.)

Analyses and checks can be scheduled at the frequency of your choice.

Typically, the Mass Data Discovery module is used first, as it allows for an exhaustive mapping of the data by answering the following questions :

  • What datasets are available?

  • Where are they located ?

  • What do they contain ?

  • What is their quality score ?

The automatically reported information is as follows :

  1. The number of rows and columns of each analyzed dataset

  2. For each analyzed column :

    1. The data’s type (date, integer, decimal, text,..etc).

    2. The nature of the data : whether it is a first name, a phone number, an email, a country, an IBAN,… Tale of Data offers dozens of pre-built natures, but you can add your own natures that will also be recognized automatically.

    3. The number and percentage of missing or invalid data (e.g., a malformed phone number or email)

    4. Statistics for the column (mean, standard deviation, number of distinct values, min, max, percentiles, etc.)

Data Observability is a proactive approach to data management that allows you to be alerted in real-time when one of the following issues occurs in your data :

  • Freshness

    My data is not updated at the correct frequency.

  • Volume

    I have too many or too few rows following a process.

  • Quality

    I have missing, malformed, or aberrant data.

  • Schema

    There has been a change in the columns (addition, deletion, type change) that causes some processes to fail.

  • Lineage (also called Data Lineage)

    My processing chain is broken. I want to know where the problem is coming from and which downstream processes are impacted.

The Data Observability module of Tale of Data will use the information collected, at the frequency of your choice, by the Mass Data Discovery module to install targeted monitoring and alerts. Indeed, this information forms time series that can be analyzed and monitored.

For example, you can be alerted, if you wish, that the number of valid values in a dataset has decreased by more than X% in 24 hours (if you have configured a monitoring frequency of one day).

It is important to note that the Mass Data Discovery and Data Observability modules can be used by non-technical profiles, just like all the modules of Tale of Data. Since the data is automatically discovered, the setup time is minimal : checks can be launched out-of-the-box within a few minutes, once the storage system credentials are entered into the catalog.

In addition to mass checks, targeted checks (implemented by Tale of Data flows) can be scheduled at the frequency of your choice. These targeted controls allow :

  • to perform advanced custom checks specific to a dataset (e.g., business rules involving multiple columns, fuzzy joins to identify differences between two or more datasets, configurable deduplication functions)

  • to perform routings (i.e., separating records that do not meet the specified criteria)

Hint

A full video tutorial on the MDD module is available here.