Data integration best practices: Data Transformation

Jacob Horbulyk Integration best practices, know-how

Data Integration Best Practices - Data transformation

So far, we have addressed only the problems that concern the pure technical mechanics of integration. Now it’s time to review the challenges that are related to the correct application of business rules. These are, as we stated in the first article of our blog series, typically of higher level in nature and answer the question of “What?”. And the biggest “business-related” challenge is… data transformation.

Data Transformation

Almost all systems have some sort of schema, by which we mean the structure of the information that exists in a system. In general, schema includes two parts:

  • A list of object types within the system
  • A set of rules which describe the structure of a given object of a given type

Generally, systems can be classified into one of three categories based on how they handle their schema.

Zero customization

In the first category, all system customers get the same set of objects and the same set of rules that describe object structure. In this scenario, the software vendor defines and delivers this objects list and set of rules. A typical example of such a system would be credit card payment processing. All credit cards must contain the same set of information to allow for transactions between different banks. Hence, it doesn’t matter what system you use for credit card payment processing. You are never going to have something other than these specific objects and you’ll never have any customizations to those objects, such as “process this custom credit card”.

Partial customization

In the second category, all system customers have the same set of objects, but they can customize the structure of these objects. In this scenario, the software vendor, too, is the one which sets this objects list. A good example would be an eCommerce application. Here, you will have such objects as products, orders, customers, and so on. Therehave is no particular reason to invent some other type of object. However, you can customize certain information related to, say, the product object, depending on what you are selling. For example, a shop selling computer parts would have different adjectives to describe their products than a shop that sells soap. These adjectives will be part of the product information.

Full customization

Last but not least, in the third category, customers have the ability to customize literally everything. They can change the set of objects as well as the structures for these objects. An example of that would be a CRM system that a travel company initially bought for customer management. Since the software vendor ensures high levels of customization, the company decides to adapt it to its travel management needs as well. And so, they will create new objects such as a trip object, a destination object, and a ticket object.

Data mapping

When we need to sync data between systems that have been developed independently of each other, the schemata for these systems will be different. This means that some data transformation needs to occur to the structure as it moves between systems.

Understanding how this data needs to be transformed – in other words, what fields in the first schema need to map to the fields in the second schema –, is generally a business level problem. Furthermore, this problem needs to be solved per pair of systems.

Additionally, if either system does not fall into the category #1 described above and it has undergone some customizations, then the generated mapping rules will be applicable only to these specific systems purchased by this one specific company. In other words, no other company will be able to use the same mapping rules unless they customize the systems in question in exactly the same way.

Types of Fields To Map

One of the things that is often overlooked when describing the requirements for a mapping is that not all fields can be mapped as “simple” fields. By simple fields we mean information that is a series of strings, numbers, booleans, timestamps, and so on, that exist as pieces of information in their own right, and could be held in structures or arrays. An example of such information would be Name, Email or Address on a contact in a CRM system.

Now, in addition to these “simple” fields, let’s consider the following types of information that we can store as part of an object:

  • Static Enums: These contain information that indicates a selection from a fixed list of values. For example, consider Gender or Office Location Code on a contact. Knowing that a contact is “M” and works in the office “DE01” already provides some information about the contact. However, knowing the list of options and what options exist provides useful context.
  • Dynamic Enums: These are information that isn’t fully free form. However, you can select it from a range of values with the possibility of adding new values. Following the same example use case, on a contact in a CRM system, this could be Department or Job Title.
  • The ID of the object itself: For example, Contact ID.
  • The ID of related objects: For example, Manager ID, Company ID, Active Subscription(s) ID(s), and so on.

The mapping rules

We’ve already previously talked about the ID linking problem. Therefore, we will focus on mapping rules for simple fields, static enums and dynamics enums.

  1. Simple fields are easiest to map: You can simply move the data as is. Alternatively, you have data transformation based on some fixed rules that don’t require additional inputs.
  2. Static Enums often require an additional dictionary to perform the mapping. This dictionary helps convert the value of one possibility in one system to the value in the other system. A simple example would be converting “Female” to “Ms.”.
  3. Dynamic Enums are often transferred as they are, but here, data transformation is slightly trickier. Sometimes, one needs to add new options to the range of existing valid values before transferring data. For example, imagine we are moving profession information into a system where we need to explicitly add each assignable profession to a list of allowed options. So far, that system has only seen “engineer”, “accountant” and “product manager”. To add a contact with profession “lawyer”, we must first add “lawyer” to the list of allowed professions.

One Directional Data Transformation vs Bi-Directional Data Transformation

If data flows in only one direction, then the data transformation logic has to be defined in only one direction. A business case example would be an SMS push after the dispatchment of an order.

If data flows in two directions, then the data transformation logic has to be defined, of course, in both directions. Moreover, the transformations described in the logic must be reversible. Otherwise, there is only one way in which you can transport data. A very simple example of that would be two CRM systems. One of them accepts German letters and another one doesn’t. For a German address to be able to sync between these two, the data transformation logic for the address field must convert the letter “ß” in the German word “Strasse” (= “street”) into double “ss” and backwards.

Master Schemas & Master Data Systems

More often than not, the number of data systems to synchronize is larger than 2. If so, the number of connection pairs increases exponentially because of the many possible pairing combinations between these systems. In order to keep the number of pairs under control, one solution is to create a master schema. Or add a master data system. With this technique, you need to map each schema only once – to the master schema.

The idea behind a master schema is that one defines a schema that meaningfully represents objects in all systems. Transformations are then written between a given systems schema and the master schema. This reduces the number of transformation pairs considerably.

A step beyond master schema is to create a master data system. The idea behind a master data system is that in addition to data being stored in systems that manipulate or otherwise use the data, the data is also stored in the master data system. All integrations are then written to work between each system and the master data system.

So far, we have covered the problems that concern the pure technical mechanics of integration and the problems that depend on the correct application of business rules. In our next article of our Data Integration Best Practices series, we will have a look at the problems that combine both. To be more precise, we will focus on how to maintain data consistency. We will review the types of errors that can occur during an integration and the ways how to potentially handle them. Stay tuned by following us on Twitter and LinkedIn!


About the Author

Jacob Horbulyk

Twitter

Pre-sales & Professional Services Engineer at elastic.io. Casual board gamer, language learner, loves a good weekend road trip.


You might want to check out also these posts