Common Practice Joining

Hi,
I am running into an issue where I am trying to join two datasets. Joining on a Unique ID is not an option as whoever set up this dataset decided not to include one. My next thought process was to join on date and name. Here is the issue I am running into. Dataset one has the full name for example Fredrick. Dataset 2 just has Fred. Its for multiple names in the datasets. I was thinking I could join them on last name instead but there are multiple last names that are similar. Is there a way to do index matching where Id just take the 3 first indexes and match them? What other ways would yall approach this?
thank you in Advance

Best Answer

  • ColemenWilson
    edited January 23 Answer ✓

    You said it's for multiple names in the dataset - I'm interpreting that as not very many and you know which they are? I would suggest just cleaning up the names on the dataset in the ETL using the formula tile:

    CASE WHEN `FirstName` = 'Fred' AND `LastName` = 'Johnson' THEN 'Frederick' ELSE `FirstName` END

    Because even full names are not always unique, I suggest bringing in something like email or creating your own ID for each individual that is in your historic data and having a new ID generated for every name added later.

    If I solved your problem, please select "yes" above

Answers

  • ColemenWilson
    edited January 23 Answer ✓

    You said it's for multiple names in the dataset - I'm interpreting that as not very many and you know which they are? I would suggest just cleaning up the names on the dataset in the ETL using the formula tile:

    CASE WHEN `FirstName` = 'Fred' AND `LastName` = 'Johnson' THEN 'Frederick' ELSE `FirstName` END

    Because even full names are not always unique, I suggest bringing in something like email or creating your own ID for each individual that is in your historic data and having a new ID generated for every name added later.

    If I solved your problem, please select "yes" above

  • I believe Its only for a few, however, I was trying to future proof it in case new names are added.

  • I'd work to find a better unique identifier than name for the future. The best place to do this is the source systems. Otherwise maintenance could be a headache.

    If I solved your problem, please select "yes" above

  • Agreed, I didnt set up this dataset nor do I have access to go and change it at the source system.