2. Catalog view

2.1. Catalog view

The catalog can be accessed via the main Tale of Data menu.

The catalog is divided into two parts:

  • Access image2 to datasets

  • Access image3 to repositories

image1

2.1.1. Accessing datasets

The Access to datasets view in the catalog lets you:

  • Add new data sources (files, databases).

  • Find the datasets you need with an easy to use single graphical interface, which remains the same whatever the underlying storage system (files, databases) may be.

  • Upload files from your computer.

  • D’import one or more datasets into a flow.

  • Access the datasets produced by a flow (target).

  • Preview data

The catalog screen appears as shown below:

image18

In the following main zones:

  • Zone for browsing image19 and selecting datasets.

  • Information zone image20 for the selected dataset.

  • Upload zone image21 for the dataset.

  • Preview zone image22 for the selected dataset.

2.1.2. Accessing repositories

This screen gives you access to all the repositories available to the user.

image23

In the following main zones:

  • Zone for browsing image24 and selecting repositories

  • Information zone image25 for the selected repository.

  • Preview zone image26 for data in the selected repository.

2.2. Upload files into Tale of Data

2.2.1. Uploading a file

image27

  • Open the file selection dialogue box image28.

  • Drag and drop files into this zone to upload them image29.

  • My workspace destination repository to which the file will be uploaded image30.

  • CSV file repair image31: this makes a CSV file compatible with processing in a distribution environment:

    • Delete the rows before the first row of data or the header.

    • Standardize the number of columns (longest row will act as benchmark and blank cells will be added if necessary).

    • Delete cells with over 4 096 characters.

    • Eliminate empty rows.

2.2.2. Alter the destination repository for upload

If a new file being uploaded has the same name as a file already in the My workspace upload repository , the new file will replace the old file. To change the destination repository, simply select it in the tree before making the upload:

image32

  • Select the “ToD Demo” destination repository in the tree image33.

  • Confirm the selected destination repository image34. Files will be uploaded to this repository.

2.2.3. Creating, renaming and deleting repositories and files

The following zone is in the top left-hand corner of the catalog screen

image35

  • Creating a repository image36.

  • Renaming the selected repository or file image37.

  • Rename the selected repository (and all its content) or file image38.

  • Configuring the import of selected datasets to a (new or existing) flow image39.

Caution

Note: Deleted repositories and files cannot be recovered.

2.3. Creating flows from the catalog

The start guide gives simple details on how to create a flow from the home screen, and how to add source data that can also be viewed in the catalog.

There are, however, two other ways of creating a flow:

  • by selecting datasets in catalog view, and by using the add to a flow button (image54 in the screen capture below).

  • going via flow view and adding a flow. This must be done before any dataset is added.

We detail below the first of these methods, which is usually the quickest and easiest as it uses datasets to create new flows and also to add them to existing flows.

2.3.1. Add one or more datasets to an (existing or new) flow.

The second method is the quickest of all. The following zone appears in the top left of the catalog screen.

image52

  • Selection zone for the dataset(s) to be imported into a flow. To select several datasets, keep CTRL depressed while clicking each relevant dataset image53.

  • Configuration button for imports into a flow image54.

The following window will open once you have clicked the configuration button for an import into a flow:

image55

  • List of previously selected datasets image56.

  • Addition operations image57.

  • Fast configuration button for Additions. All datasets will have their own Addition operation image58.

  • Tab for switching between new and existing flows.

You can quickly add selected datasets to an existing or new flow, while at the same time configuring how the datasets will interact with each other (Addition):

Standalone dataset

The dataset will be added without any link in a flow.

Prepared dataset

The dataset will be added to a flow with a link to a preparation.

Union dataset

The dataset will be added to a unique join node. Dataset order will be maintained.

Join dataset

The dataset will be added to a unique join node.

Dataset for enrichment

The dataset will be added to a unique enrichment node as the dataset receiving* new columns.

Enrichment dataset

The dataset will be added to a unique enrichment node as the dataset donating* new columns.

Tip

The parameters of each dataset can be altered in the flow if necessary.

You can re-order datasets using drag and drop. Order is important only to union datasets.

Once the operation has been validated, the view will switch to the (new or existing) flow concerned in Flow Designer.

2.4. Other catalog view functions

2.4.1. Deletion of relational database tables

To select a relational database table image40 (e.g. MariaDB, SQL Server, PostgreSQL…etc.), click delete image41:

image42

This button will delete the table, so long as the database account associated with the data source has the necessary authorizations.

2.4.2. Display the number of rows in a dataset

Right-click a dataset in the catalog to access information on this dataset:

image43

Number of rows image44 in the dataset: The time required to calculate this information will depend on storage type (file, database).

2.4.3. Display dataset statistics

Use image45 to access the statistics for the values in the dataset (calculation time will depend on the type of storage and also on the number of columns and rows):

image46

The following will be calculated for each column in a dataset:

Count

number of non-empty values in the column.

Mean

average column value (does not apply to text or Boolean data).

Stddev

standard deviation of column values (does not apply to text or Boolean data).

Min

lowest column value.

Percentile 25

column value below which 25% of the dataset rows lie (does not apply to text or Boolean data).

Percentile 50

column value below which 50% of the dataset rows lie (does not apply to text or Boolean data). This is therefore the median.

Percentile 75

column value below which 75% of the dataset rows lie (does not apply to text or Boolean data).

Max

highest column value.

2.4.4. Downloading a dataset from the catalog

When you select a dataset in the catalog tree, an upload configuration zone image47 will appear.

Configuration options depend on the type of dataset selected. For example, if you select a CSV file but dislike the auto detect results, you can change settings, such as separator, encryption, etc.

If you change the configuration, you must click Apply image48 to confirm the new settings.

Click the button image49 image50 to the right of Apply to download the dataset.

You will download a zipped csv file (file.gz).

image51

Note

Tale of Data lets you download up to 1 000 000 rows from the catalog (rows that exceed this limit will be ignored).

Depending on the number of rows and columns involved, download can take between a few seconds and a few minutes.

2.5. Flow, dataset and record lineage.

Lineage lets you view a full data processing chain involving multiple flows, based on one dataset.

image60

In catalog datasets view:

  • Select a dataset image61

    (in this example: customers_tod_final_particuliers_deduped.parquet).

  • Click the target icon image62 image63.

The Lineage window will open. Two approaches will be offered:

  • Upstream Lineage.

  • Downstream Lineage.

2.5.1. Upstream lineage

image64

This view lets you browse through the flows and datasets that helped create the selected dataset.

Access any upstream flow by selecting it and then clicking Open the Flow image65.

2.5.2. Downstream lineage

image66

This view lets you browse through the flows and datasets fed by the selected dataset. Access any downstream flow by selecting it and then clicking Open the Flow image67.

Lineage can also be viewed in the flow via any “source** or target node configurator:

image68

Access Lineage from a flow image69