Data integration best practices: Mechanisms to detect data changes

Jacob Horbulyk data integrations, know-how

Three mechanisms to detect data changes

In the previous article of our series Data Integration Best Practices, we defined that all problems related to data integration can be essentially broken down into two large categories: The ones that have something to do with technical mechanics and the ones that are connected to business rules.

In the second article of this series, we will review one of the purely technical questions, namely how to detect any data changes. Note that the technical decisions that we will cover in this and the next few articles are not necessarily dictated by what type of system you’re dealing with: cloud-based, on-premises systems or hybrid.

Why is detecting data changes important?

When synchronizing large amounts of data and/or working with dynamically changed data, it’s highly valuable to only synchronize the data that has actually changed. Otherwise, re-syncing all data records every time might, for example, consume a lot of computing resources and eventually lead to an unnecessary system overload.

This means that detection of only the data that was changed is a crucial property of the systems. The following sections describe the different strategies to do such detection.

Polling by timestamp

Here, the system that stores data, e.g. a CRM application, also stores and maintains a last modified date along. This date describes when a given record or entity has last been modified including creation and possibly, even deletion. We have covered this feature in a bit more detail in “6 Characteristics That Make APIs Fit for Application Integration”.

When this strategy applies, there is some integration flow somewhere that wakes up on a scheduled interval. It looks for all records that have been updated since this flow’s last run. It then reports only those records as changed.

For this strategy to work, the integration flow must be able to store the information about its last run. Alternatively, it should be able to otherwise reliably learn when this information.

Pros of this strategy:

  • Wide support for the strategy
  • Has built-in failure tolerance as it is possible to repeat queries if the request has failed for any reason

Cons of this strategy:

  • Not all systems support this feature
  • There is a delay between modification occurring and change detection. Unless, of course, you have configured the integration flow to wake up every millisecond
  • Not quite resource efficient as many checks for data changes will be empty if the dataset does not change often

Data push from place of change

In this strategy, the system that stores data sends modifications made to data to other systems as the modifications occur. For example, it could publish a REST/SOAP message over HTTP(S) or send an email.

Pros of this strategy:

  • You can detect data changes faster than by polling
  • You will use your resources more efficiently as there will be no “blank” runs

Cons of this strategy:

  • Not all systems support this feature
  • Lost publishes will not run again, therefore you need to know how to recover from failures in the push infrastructure

Bulk Extract with Delta Detection

This strategy is practically the last resort when the above two strategies do not apply. Here, in addition to all data records stored inside the system, there is a hashed copy of them with their IDs outside the system. A simple example would include some data stored inside a CRM application. Then there would be a copy of that data stored in some integration middleware – i.e. outside the CRM application.

At a regular interval, a scheduled integration flow will wake up, read all data records and compute their hash – in other words, their unique code. Then it will compare the value of each such hash with the stored hash for each corresponding record. If the hashes do not match, it will report the record as changed and store the new hash.

Pros of this strategy:

  • This feature has universal support
  • It might be the only way to detect deletion of a data record, because not many systems have a built-in mechanism for that
  • Can be extended to detect field level changes, meaning that it will detect only the data changes that are of a particular interest / relevance for the given integration scenario

Cons of this strategy:

  • It’s very resource intensive and, as a consequence, cannot be run as frequently
  • Requires an additional data store
  • There might be a risk of falsely reporting deletions if the actual system is down for some reason and the integration flow is trying to read records at exactly the same moment

In order to perform any of the aforementioned “extractions”, there must obviously be a way to connect to a system in the first place. Indeed, even though we often talk about disparate systems and data silos in the context of application integration, no system is designed as a sealed cocoon.

So, in the next article, we’ll briefly review the two ways to connect to a system, going through the pros and cons of each. Stay tuned by following us on Twitter and LinkedIn!


About the Author

Jacob Horbulyk

Twitter

Pre-sales & Professional Services Engineer at elastic.io. Casual board gamer, language learner, loves a good weekend road trip.


You might want to check out also these posts