2. Catalog view
2.1. Catalog view
The catalog can be accessed via the main Tale of Data menu.
The catalog is divided into two parts:
2.1.1. Accessing datasets
The Access to datasets view in the catalog lets you:
Add new data sources (files, databases).
Find the datasets you need with an easy to use single graphical interface, which remains the same whatever the underlying storage system (files, databases) may be.
Upload files from your computer.
D’import one or more datasets into a flow.
Access the datasets produced by a flow (target).
Preview data
The catalog screen appears as shown below:
In the following main zones:
2.1.2. Create or Edit a CSV file
From the catalog, it is possible to edit the CSV files available in the user’s workspace.
The view will then adjust to allow the user to easily modify the desired field.
Creating an empty CSV file is also possible.
2.1.3. Accessing repositories
This screen gives you access to all the repositories available to the user.
In the following main zones:
2.2. Upload files into Tale of Data
2.2.1. Uploading a file
My workspace destination repository to which the file will be uploaded
.
CSV file repair
: this makes a CSV file compatible with processing in a distribution environment:
Delete the rows before the first row of data or the header.
Standardize the number of columns (longest row will act as benchmark and blank cells will be added if necessary).
Delete cells with over 4 096 characters.
Eliminate empty rows.
2.2.2. Alter the destination repository for upload
If a new file being uploaded has the same name as a file already in the My workspace upload repository , the new file will replace the old file. To change the destination repository, simply select it in the tree before making the upload:
2.2.3. Creating, renaming and deleting repositories and files
The following zone is in the top left-hand corner of the catalog screen
Rename the selected repository (and all its content) or file
.
Configuring the import of selected datasets to a (new or existing) flow
.
Caution
Note: Deleted repositories and files cannot be recovered.
2.3. Creating flows from the catalog
The start guide gives simple details on how to create a flow from the home screen, and how to add source data that can also be viewed in the catalog.
There are, however, two other ways of creating a flow:
by selecting datasets in catalog view, and by using the add to a flow button (
in the screen capture below).
going via flow view and adding a flow. This must be done before any dataset is added.
We detail below the first of these methods, which is usually the quickest and easiest as it uses datasets to create new flows and also to add them to existing flows.
2.3.1. Add one or more datasets to an (existing or new) flow
The second method is the quickest of all. The following zone appears in the top left of the catalog screen.
Selection zone for the dataset(s) to be imported into a flow. To select several datasets, keep CTRL depressed while clicking each relevant dataset
.
This button is available both on the catalog screen and in the configuration panel of each source and sink within a flow:
The following window will open once you have clicked the configuration button for an import into a flow:
You can quickly add selected datasets to an existing or new flow, while at the same time configuring how the datasets will interact with each other (Addition):
- Standalone dataset
The dataset will be added without any link in a flow.
- Prepared dataset
The dataset will be added to a flow with a link to a preparation.
- Union dataset
The dataset will be added to a unique join node. Dataset order will be maintained.
- Join dataset
The dataset will be added to a unique join node.
- Dataset for enrichment
The dataset will be added to a unique enrichment node as the dataset receiving* new columns.
- Enrichment dataset
The dataset will be added to a unique enrichment node as the dataset donating* new columns.
Tip
The parameters of each dataset can be altered in the flow if necessary.
You can re-order datasets using drag and drop. Order is important only to union datasets.
Once the operation has been validated, the view will switch to the (new or existing) flow concerned in Flow Designer.
2.3.2. Configuring the import of selected datasets to a (new or existing) flow
As part of using Tale of Data, there may be times when the user needs to bulk move tables from one system to another, or from one folder or group of tables to another within a given system. This operation can be broken down into a series of flows, each transferring a table from a source to a target. In practice, the work required to create these flows can become repetitive and tedious, which is why the Migrate Datasets function was created.
With just one click, it’s possible to generate the necessary flows to migrate the selected dataset(s) from the catalog to a target in a specific location within a database or a specific folder. A flow containing a source and a target will be automatically created for each selected dataset, along with a dedicated sequence of flows, allowing the complete migration to be carried out in a single operation. It’s also possible to schedule this sequence of flows so it runs regularly in the future. The more tables there are to migrate, the more time this feature can save.
Note
This migration operation can be extremely useful for decoupling the tables containing the original data from those used by generic flows for performing reusable processing operations. To load a different dataset, you simply need to launch a migration sequence, which will load the new data, and then run the generic sequence that handles the actual processing. This way, you can choose to load different versions of datasets (for example, snapshots taken at different times), training datasets, datasets containing only problematic cases (extracted by other flows), or shortened datasets to speed up processing development. The data lineage will allow you to see all the data sources feeding these generic tables and visually understand the processing logic. This approach makes it easier to reuse flows and to create generic, modular, and reusable tools.
2.4. Other catalog view functions
2.4.1. Deletion of relational database tables
To select a relational database table (e.g. MariaDB, SQL Server, PostgreSQL…etc.), click delete
:
This button will delete the table, so long as the database account associated with the data source has the necessary authorizations.
2.4.2. Display the number of rows in a dataset
Right-click a dataset in the catalog to access information on this dataset:
Number of rows in the dataset: The time required to calculate this information will depend on storage type (file, database).
2.4.3. Display dataset statistics
Use to access the statistics for the values in the dataset (calculation time will depend on the type of storage and also on the number of columns and rows):
The following will be calculated for each column in a dataset:
- Count
number of non-empty values in the column.
- Mean
average column value (does not apply to text or Boolean data).
- Stddev
standard deviation of column values (does not apply to text or Boolean data).
- Min
lowest column value.
- Percentile 25
column value below which 25% of the dataset rows lie (does not apply to text or Boolean data).
- Percentile 50
column value below which 50% of the dataset rows lie (does not apply to text or Boolean data). This is therefore the median.
- Percentile 75
column value below which 75% of the dataset rows lie (does not apply to text or Boolean data).
- Max
highest column value.
2.4.4. Downloading a dataset from the catalog
When you select a dataset in the catalog tree, an upload configuration zone will appear.
Configuration options depend on the type of dataset selected. For example, if you select a CSV file but dislike the auto detect results, you can change settings, such as separator, encryption, etc.
If you change the configuration, you must click Apply to confirm the new settings.
Click the button
to the right of Apply to download the dataset.
You will download a zipped csv file (file.gz).
Note
Tale of Data lets you download up to 1 000 000 rows from the catalog (rows that exceed this limit will be ignored).
Depending on the number of rows and columns involved, download can take between a few seconds and a few minutes.
2.5. Flow, dataset and record lineage
Lineage lets you view a full data processing chain involving multiple flows, based on one dataset.
In catalog datasets view:
-
(in this example: customers_tod_final_particuliers_deduped.parquet).
The Lineage window will open. Two approaches will be offered:
Upstream Lineage.
Downstream Lineage.
2.5.1. Upstream lineage
This view lets you browse through the flows and datasets that helped create the selected dataset.
Access any upstream flow by selecting it and then clicking Open the Flow .
2.5.2. Downstream lineage
This view lets you browse through the flows and datasets fed by the selected dataset. Access any downstream flow by selecting it and then clicking Open the Flow .
Lineage can also be viewed in the flow via any source or sink node configurator: