2. Catalog view
2.1. Catalog view
The catalog can be accessed via the main Tale of Data menu.
The catalog is divided into two parts:
2.1.1. Accessing datasets
The Access to datasets view in the catalog lets you:
Add new data sources (files, databases).
Find the datasets you need with an easy to use single graphical interface, which remains the same whatever the underlying storage system (files, databases) may be.
Upload files from your computer.
D’import one or more datasets into a flow.
Access the datasets produced by a flow (target).
Preview data
The catalog screen appears as shown below:
In the following main zones:
2.1.2. Accessing repositories
This screen gives you access to all the repositories available to the user.
In the following main zones:
2.2. Upload files into Tale of Data
2.2.1. Uploading a file
My workspace destination repository to which the file will be uploaded .
CSV file repair : this makes a CSV file compatible with processing in a distribution environment:
Delete the rows before the first row of data or the header.
Standardize the number of columns (longest row will act as benchmark and blank cells will be added if necessary).
Delete cells with over 4 096 characters.
Eliminate empty rows.
2.2.2. Alter the destination repository for upload
If a new file being uploaded has the same name as a file already in the My workspace upload repository , the new file will replace the old file. To change the destination repository, simply select it in the tree before making the upload:
2.2.3. Creating, renaming and deleting repositories and files
The following zone is in the top left-hand corner of the catalog screen
Rename the selected repository (and all its content) or file .
Configuring the import of selected datasets to a (new or existing) flow .
Caution
Note: Deleted repositories and files cannot be recovered.
2.3. Creating flows from the catalog
The start guide gives simple details on how to create a flow from the home screen, and how to add source data that can also be viewed in the catalog.
There are, however, two other ways of creating a flow:
by selecting datasets in catalog view, and by using the add to a flow button ( in the screen capture below).
going via flow view and adding a flow. This must be done before any dataset is added.
We detail below the first of these methods, which is usually the quickest and easiest as it uses datasets to create new flows and also to add them to existing flows.
2.3.1. Add one or more datasets to an (existing or new) flow.
The second method is the quickest of all. The following zone appears in the top left of the catalog screen.
Selection zone for the dataset(s) to be imported into a flow. To select several datasets, keep CTRL depressed while clicking each relevant dataset .
The following window will open once you have clicked the configuration button for an import into a flow:
Fast configuration button for Additions. All datasets will have their own Addition operation .
Tab for switching between new and existing flows.
You can quickly add selected datasets to an existing or new flow, while at the same time configuring how the datasets will interact with each other (Addition):
- Standalone dataset
The dataset will be added without any link in a flow.
- Prepared dataset
The dataset will be added to a flow with a link to a preparation.
- Union dataset
The dataset will be added to a unique join node. Dataset order will be maintained.
- Join dataset
The dataset will be added to a unique join node.
- Dataset for enrichment
The dataset will be added to a unique enrichment node as the dataset receiving* new columns.
- Enrichment dataset
The dataset will be added to a unique enrichment node as the dataset donating* new columns.
Tip
The parameters of each dataset can be altered in the flow if necessary.
You can re-order datasets using drag and drop. Order is important only to union datasets.
Once the operation has been validated, the view will switch to the (new or existing) flow concerned in Flow Designer.
2.4. Other catalog view functions
2.4.1. Deletion of relational database tables
To select a relational database table (e.g. MariaDB, SQL Server, PostgreSQL…etc.), click delete :
This button will delete the table, so long as the database account associated with the data source has the necessary authorizations.
2.4.2. Display the number of rows in a dataset
Right-click a dataset in the catalog to access information on this dataset:
Number of rows in the dataset: The time required to calculate this information will depend on storage type (file, database).
2.4.3. Display dataset statistics
Use to access the statistics for the values in the dataset (calculation time will depend on the type of storage and also on the number of columns and rows):
The following will be calculated for each column in a dataset:
- Count
number of non-empty values in the column.
- Mean
average column value (does not apply to text or Boolean data).
- Stddev
standard deviation of column values (does not apply to text or Boolean data).
- Min
lowest column value.
- Percentile 25
column value below which 25% of the dataset rows lie (does not apply to text or Boolean data).
- Percentile 50
column value below which 50% of the dataset rows lie (does not apply to text or Boolean data). This is therefore the median.
- Percentile 75
column value below which 75% of the dataset rows lie (does not apply to text or Boolean data).
- Max
highest column value.
2.4.4. Downloading a dataset from the catalog
When you select a dataset in the catalog tree, an upload configuration zone will appear.
Configuration options depend on the type of dataset selected. For example, if you select a CSV file but dislike the auto detect results, you can change settings, such as separator, encryption, etc.
If you change the configuration, you must click Apply to confirm the new settings.
Click the button to the right of Apply to download the dataset.
You will download a zipped csv file (file.gz).
Note
Tale of Data lets you download up to 1 000 000 rows from the catalog (rows that exceed this limit will be ignored).
Depending on the number of rows and columns involved, download can take between a few seconds and a few minutes.
2.5. Flow, dataset and record lineage.
Lineage lets you view a full data processing chain involving multiple flows, based on one dataset.
In catalog datasets view:
-
(in this example: customers_tod_final_particuliers_deduped.parquet).
The Lineage window will open. Two approaches will be offered:
Upstream Lineage.
Downstream Lineage.
2.5.1. Upstream lineage
This view lets you browse through the flows and datasets that helped create the selected dataset.
Access any upstream flow by selecting it and then clicking Open the Flow .
2.5.2. Downstream lineage
This view lets you browse through the flows and datasets fed by the selected dataset. Access any downstream flow by selecting it and then clicking Open the Flow .
Lineage can also be viewed in the flow via any “source** or target node configurator: