Data integration best practice: Data duplication and ID linking

Jacob Horbulyk data integrations, know-how

Data duplication and ID linking

Let us first make it clear what ID linking is about and what it has to do with data duplication. To do that, let’s take CRM systems as an example.

It is likely there are more than one person in a system who share the same first and last names. And it is inevitable that sometimes, some pieces of information about these people will change. So, how would a CRM system “know” that it is the Jane Smith in Branchville, Virginia, whose billing address needs to be updated, and not the Jane Smith in Shoreline, Washington? Quite simple – the CRM has assigned a unique ID to each Jane Smith, by which it can “distinguish” one from the other.

But what if we have two CRM systems – one that is used in, say, a parent company and the other one in a daughter company? There can be a business scenario where both of them need to store the same information – both Jane Smiths – but each would assign its own ID just because they are designed this way.

This is where ID linking comes into play. When there is the same record in more than one system, there must be some way to correlate these records between systems. This is essential in the case when you have data that can change, because then you need to find a way to distinguish between an existing data that has changed and a completely new data that is just somewhat similar to the existing data. And here are some techniques to solve these data duplication problems.

Point to Point ID Linking

In this solution, if there is data duplication between systems, one (or more) of the systems is responsible for storing the other systems ID for the related object. The idea behind it is that, taking the scenario above, one CRM system would store the ID of another CRM. In addition to that, there is a separate system which sole purpose is to simply keep track of these ID links.

Pros of this strategy:

  • It’s easy to set up and frequently available
  • Provides convenient mechanism for end user to modify links
  • This is a natural way of storing links as link life cycle is logically the same as the life cycle of linked object

Cons of this strategy:

  • Conceptual complexity grows quite quickly as you add more use cases and additional systems
  • This method requires adding write permissions to integration flows that otherwise wouldn’t require it
  • It also requires the systems to have available fields for foreign IDs – that is, for the IDs of other systems. Or have configurable schemata, for that matter. This implies having space to store this additional information. However, you may not have this capability in terms of the way how the system is designed.
  • It is easy to inadvertently mess up data integrity
  • It must be designed per pair of systems

Dedicated System for Storing ID Links

In this solution, you would set up a separate database which is responsible for correlating IDs between objects.

Pros of this strategy:

  • It’s conceptually simple
  • Scales as number of systems grow

Cons of this strategy:

  • There is yet one more database to maintain
  • It’s difficult for users to visualize or modify linked IDs

Universal ID

In this solution, when a record such as a new customer or a new product is created, they are assigned a globally unique ID that is referenced by all systems. For this technique to be possible, there must be just one system that is responsible for generating IDs for all records. It would also be responsible for every other system to learn these IDs.

Pros of this strategy:

  • It’s conceptually simple

Cons of this strategy:

  • It requires LOTS of human coordination
  • Eliminating data duplication or redundant information becomes hard
  • The design of all systems must accommodate a universal ID

So, to recap, in this article we reviewed how to deal with data duplication effectively and how to keep data in such a scenario up-to-date. In the next article of our Data Integration Best Practices series, we will have a look at the problem that is related essentially to the question of business rules: Data transformation.

We will review the three categories into which systems can be classified based on how they handle their schema. In addition to that, we will review what master schemata are and how they are related to master data systems. Stay tuned by following us on Twitter and LinkedIn!


About the Author

Jacob Horbulyk

Twitter

Pre-sales & Professional Services Engineer at elastic.io. Casual board gamer, language learner, loves a good weekend road trip.


You might want to check out also these posts