2. Catalog view

2.1. Catalog view

The catalog can be accessed via the main Tale of Data menu.

The catalog is divided into two parts:

  • Access image2 to datasets

  • Access image3 to repositories

image1

2.1.1. Accessing datasets

The Access to datasets view in the catalog lets you:

  • Add new data sources (files, databases).

  • Find the datasets you need with an easy to use single graphical interface, which remains the same whatever the underlying storage system (files, databases) may be.

  • Upload files from your computer.

  • D’import one or more datasets into a flow.

  • Access the datasets produced by a flow (target).

  • Preview data

The catalog screen appears as shown below:

image18

In the following main zones:

  • Zone for browsing image19 and selecting datasets.

  • Information zone image20 for the selected dataset.

  • Upload zone image21 for the dataset.

  • Preview zone image22 for the selected dataset.

2.1.2. Create or Edit a CSV file

From the catalog, it is possible to edit the CSV files available in the user’s workspace.

image584

The view will then adjust to allow the user to easily modify the desired field.

image586

Creating an empty CSV file is also possible.

image588

image590

2.1.3. Accessing repositories

This screen gives you access to all the repositories available to the user.

image23

In the following main zones:

  • Zone for browsing image24 and selecting repositories

  • Information zone image25 for the selected repository.

  • Preview zone image26 for data in the selected repository.

2.2. Upload files into Tale of Data

2.2.1. Uploading a file

image27

  • Open the file selection dialogue box image28.

  • Drag and drop files into this zone to upload them image29.

  • My workspace destination repository to which the file will be uploaded image30.

  • CSV file repair image31: this makes a CSV file compatible with processing in a distribution environment:

    • Delete the rows before the first row of data or the header.

    • Standardize the number of columns (longest row will act as benchmark and blank cells will be added if necessary).

    • Delete cells with over 4 096 characters.

    • Eliminate empty rows.

2.2.2. Alter the destination repository for upload

If a new file being uploaded has the same name as a file already in the My workspace upload repository , the new file will replace the old file. To change the destination repository, simply select it in the tree before making the upload:

image32

  • Select the “ToD Demo” destination repository in the tree image33.

  • Confirm the selected destination repository image34. Files will be uploaded to this repository.

2.2.3. Creating, renaming and deleting repositories and files

The following zone is in the top left-hand corner of the catalog screen

image35

  • Creating a repository image36.

  • Renaming the selected repository or file image37.

  • Rename the selected repository (and all its content) or file image38.

  • Configuring the import of selected datasets to a (new or existing) flow image39.

Caution

Note: Deleted repositories and files cannot be recovered.

2.3. Creating flows from the catalog

The start guide gives simple details on how to create a flow from the home screen, and how to add source data that can also be viewed in the catalog.

There are, however, two other ways of creating a flow:

  • by selecting datasets in catalog view, and by using the add to a flow button (image54 in the screen capture below).

  • going via flow view and adding a flow. This must be done before any dataset is added.

We detail below the first of these methods, which is usually the quickest and easiest as it uses datasets to create new flows and also to add them to existing flows.

2.3.1. Add one or more datasets to an (existing or new) flow

The second method is the quickest of all. The following zone appears in the top left of the catalog screen.

image52

  • Selection zone for the dataset(s) to be imported into a flow. To select several datasets, keep CTRL depressed while clicking each relevant dataset image53.

  • Configuration button for imports into a flow image54.

This button is available both on the catalog screen and in the configuration panel of each source and sink within a flow:

image604

The following window will open once you have clicked the configuration button for an import into a flow:

image606

  • List of previously selected datasets image56.

  • Addition operations image57.

  • Fast configuration button for Additions. All datasets will have their own Addition operation image58.

  • Tab for switching between new and existing flows image59.

You can quickly add selected datasets to an existing or new flow, while at the same time configuring how the datasets will interact with each other (Addition):

Standalone dataset

The dataset will be added without any link in a flow.

Prepared dataset

The dataset will be added to a flow with a link to a preparation.

Union dataset

The dataset will be added to a unique join node. Dataset order will be maintained.

Join dataset

The dataset will be added to a unique join node.

Dataset for enrichment

The dataset will be added to a unique enrichment node as the dataset receiving* new columns.

Enrichment dataset

The dataset will be added to a unique enrichment node as the dataset donating* new columns.

Tip

The parameters of each dataset can be altered in the flow if necessary.

You can re-order datasets using drag and drop. Order is important only to union datasets.

Once the operation has been validated, the view will switch to the (new or existing) flow concerned in Flow Designer.

2.3.2. Configuring the import of selected datasets to a (new or existing) flow

As part of using Tale of Data, there may be times when the user needs to bulk move tables from one system to another, or from one folder or group of tables to another within a given system. This operation can be broken down into a series of flows, each transferring a table from a source to a target. In practice, the work required to create these flows can become repetitive and tedious, which is why the Migrate Datasets function was created.

image600

With just one click, it’s possible to generate the necessary flows to migrate the selected dataset(s) from the catalog to a target in a specific location within a database or a specific folder. A flow containing a source and a target will be automatically created for each selected dataset, along with a dedicated sequence of flows, allowing the complete migration to be carried out in a single operation. It’s also possible to schedule this sequence of flows so it runs regularly in the future. The more tables there are to migrate, the more time this feature can save.

image602

Note

This migration operation can be extremely useful for decoupling the tables containing the original data from those used by generic flows for performing reusable processing operations. To load a different dataset, you simply need to launch a migration sequence, which will load the new data, and then run the generic sequence that handles the actual processing. This way, you can choose to load different versions of datasets (for example, snapshots taken at different times), training datasets, datasets containing only problematic cases (extracted by other flows), or shortened datasets to speed up processing development. The data lineage will allow you to see all the data sources feeding these generic tables and visually understand the processing logic. This approach makes it easier to reuse flows and to create generic, modular, and reusable tools.

2.4. Other catalog view functions

2.4.1. Deletion of relational database tables

To select a relational database table image40 (e.g. MariaDB, SQL Server, PostgreSQL…etc.), click delete image41:

image42

This button will delete the table, so long as the database account associated with the data source has the necessary authorizations.

2.4.2. Display the number of rows in a dataset

Right-click a dataset in the catalog to access information on this dataset:

image43

Number of rows image44 in the dataset: The time required to calculate this information will depend on storage type (file, database).

2.4.3. Display dataset statistics

Use image45 to access the statistics for the values in the dataset (calculation time will depend on the type of storage and also on the number of columns and rows):

image46

The following will be calculated for each column in a dataset:

Count

number of non-empty values in the column.

Mean

average column value (does not apply to text or Boolean data).

Stddev

standard deviation of column values (does not apply to text or Boolean data).

Min

lowest column value.

Percentile 25

column value below which 25% of the dataset rows lie (does not apply to text or Boolean data).

Percentile 50

column value below which 50% of the dataset rows lie (does not apply to text or Boolean data). This is therefore the median.

Percentile 75

column value below which 75% of the dataset rows lie (does not apply to text or Boolean data).

Max

highest column value.

2.4.4. Downloading a dataset from the catalog

When you select a dataset in the catalog tree, an upload configuration zone image47 will appear.

Configuration options depend on the type of dataset selected. For example, if you select a CSV file but dislike the auto detect results, you can change settings, such as separator, encryption, etc.

If you change the configuration, you must click Apply image48 to confirm the new settings.

Click the button image49 image50 to the right of Apply to download the dataset.

You will download a zipped csv file (file.gz).

image51

Note

Tale of Data lets you download up to 1 000 000 rows from the catalog (rows that exceed this limit will be ignored).

Depending on the number of rows and columns involved, download can take between a few seconds and a few minutes.

2.5. Flow, dataset and record lineage

Lineage lets you view a full data processing chain involving multiple flows, based on one dataset.

image60

In catalog datasets view:

  • Select a dataset image61

    (in this example: customers_tod_final_particuliers_deduped.parquet).

  • Click the target icon image62 image63.

The Lineage window will open. Two approaches will be offered:

  • Upstream Lineage.

  • Downstream Lineage.

2.5.1. Upstream lineage

image64

This view lets you browse through the flows and datasets that helped create the selected dataset.

Access any upstream flow by selecting it and then clicking Open the Flow image65.

2.5.2. Downstream lineage

image66

This view lets you browse through the flows and datasets fed by the selected dataset. Access any downstream flow by selecting it and then clicking Open the Flow image67.

Lineage can also be viewed in the flow via any source or sink node configurator:

image68

Access Lineage from a flow image69