12.1. View of analyses
The list of analyses is found in the Mass Data Discovery section of Tale of Data, in the first entry of the submenu, within the first tab.
To create a new analysis, click on New Analysis at the top right. The next section details the configuration of the new analysis.
To modify an existing analysis, use the small pencil on the line corresponding to it.
To run an analysis, simply create it and click on the Run Now button in the configuration window, once the analysis is well configured. If the analysis has already been configured, it can be started directly using the “play” button next to the small pencil.
12.2. Configuration of an analysis
12.2.1. General settings
The general parameters first allow you to define a name, a description, and a group for the analysis.
12.2.2. Data sources
In the data sources tab, it is possible to simultaneously select multiple sources from the user’s catalog of available sources. Hierarchies within the sources can be expanded to pick a group of tables, a single table, or individual files. This way, Tale of Data offers great flexibility in defining the systems to analyze, allowing each analysis to focus on datasets that vary significantly in number and complexity.
12.2.3. Natures
The Natures tab allows you to specifically select natures of data that will be identified during the analysis. At least one nature must be selected.
12.2.4. Advanced statistics
It is optionally possible to enable the calculation of advanced statistics. This calculation will create a load on the server providing the data and should not be used if the results are not useful to the user, as it slows down and adds weight to the analysis. The generated statistics can be extremely useful for characterizing datasets on a large scale.
The collected statistics are :
- count_distinct
Count of distinct values in the column
- mean
Calculation of the average value for numeric columns
- stddev
Calculation of the standard deviation (variation or dispersion) of values for numeric columns
- min
Minimum value of the column
- percentile_5
Calculation of the 5th percentile for numeric columns, below which 5% of observations fall
- percentile_25
Calculation of the 25th percentile for numeric columns, below which 25% of observations fall
- percentile_50
Calculation of the 50th percentile for numeric columns, below which 50% of observations fall
- percentile_75
Calculation of the 75th percentile for numeric columns, below which 75% of observations fall
- percentile_95
Calculation of the 95th percentile for numeric columns, below which 95% of observations fall
- max
Maximum value of the column
Regarding the calculation of distinct values, two options are available :
An estimation-based calculation of the number of distinct values
An exact measurement of these distinct values
These statistics will be accessible after the analysis in two ways :
In the “anomaly mapping” section, in a Advanced Statistics section.
Within the raw exported analysis data, if the option has been enabled.
12.2.5. Export of results
This tab allows you to export the raw results into a table at the end of the analysis. If the target is a database, it will be possible to directly specify the table. If the target is a file system (such as “my workspace,” for example), you will need to specify a CSV or Parquet format.
Two writing modes are available :
Append data to the end of the target table (Results are added after aggregating the source data to form a single table).
Overwrite the data in the target table (Results replace existing data to create a new table).
When an export target has been configured, it will be mentioned at the bottom right of the panel as shown in the screenshot below.
12.2.6. Flow or sequence of flows to trigger at the end of the analysis
In this tab, it is possible to select existing flows or sequences of flows that can be automatically triggered at the end of the analysis to automate certain tasks (for example, prepare data from the obtained results).
12.2.7. Scheduling
In addition to the previous tab, it is also possible to automate a created analysis. Simply define the execution frequency directly in this tab.
Hint
The repetition frequency must be at least 10 minutes.
12.3. View of executions
When an analysis is launched, the Mass Data Discovery will scan all the datasets from the selected data sources. This is to perform inference for data types when necessary (for example, for CSV files) as well as inference of the natures (emails, phones, etc.).
All executions, whether launched, in progress, interrupted, or completed, will be visible in the Analysis Executions table. Once the execution is complete, the results can be viewed by clicking the Open button, as shown below.
12.3.1. Stopping and resuming executions
A launched execution can be stopped at any time by pressing the red square button Stop Analysis.
It is possible to restart it at any time by pressing the orange button Resume Analysis.
During the execution of an analysis, it is possible that some tables may present read errors. This can occur, for example, if in the table concerned:
there are formatting issues, or
some columns of the table are not readable by Tale of Data, or
the connection does not allow access to this table (read forbidden)
If this occurs, the MDD analysis will continue, and an error report listing the tables that failed to scan, along with the reason for each, will be immediately available for download, alongside the results that could be calculated.