Matching Columns Across Multiple Datasets

Problem Description:

I want to match columns from multiple datasets and it seems that there are several options available. Since I want to perform exact matching, I can consider the following:

  1. Using a repository: Create a repository from one dataset; use the repository in a preparation node in another flow. The downside is having to manage multiple flows.

  2. Joining: Use an appropriate join method to identify empty cells. The advantage is that everything can be done in a single flow.

  3. Enrichment: Same advantage as joining for single flow approach.

What are the best practices for exact matching? Are there more efficient solutions than others?

Solution:

For exact matching, similarity functions are irrelevant.

Joins offer capabilities that are not available with enrichment or repositories. You can choose among the following set operations:

  • INNER JOIN

  • LEFT OUTER JOIN

  • RIGHT OUTER JOIN

  • FULL OUTER JOIN

Note

Repositories perform a left outer join but with a unique match by default (the transformation can now further be configured to accept the first N best matches). Enrichment performs a left outer join with one or more matches (in both cases, “left” refers to the dataset being enriched).

It is better to use joins for exact matching.

However, if a left outer join with a unique match is sufficient and the enrichment dataset needs to be reused across different flows, repositories are superior because they are faster (because they are pre-indexed) and shareable.

For exact matches, it is better to avoid enrichment nodes:

  • Enrichment Nodes:

    Faster to use, ideal when the enrichment dataset is less than tens or hundreds of thousands of lines. However, they are relatively slow for large complex joins and are not reusable from one flow to another.

  • Repositories:

    An additional flow is needed to create the repository. The advantage is that they are very fast (even with fuzzy search) and scalable (up to hundreds of millions of lines). Additionally, they are reusable from one flow to another.