In the previous article of our last chapter of Data Integration Best Practices, we took a look at how to describe integrations in such a way that everybody – from developers to business users – understands the requirements correctly. We also discussed why you eventually might need some type of an integration layer to keep your integration projects under control.
In this article, we continue reviewing some tips that revolve around preparing for and running an integration project. As the title suggests, we will start with three types of project environments and what part of the project each type includes, and finish with some best practices for log collection.
Separation of Environments
In general, for each software system, there should be three types of environments: development, staging and production. The same is also true for an integration project. So, let’s dive deeper into what you need – or don’t need – to consider at each stage.
- In general, at this stage in an integration project, you are building the interactions to move data from inside the system to outside the system and vice versa. As a result, the system should not be connected to other systems in the environment. One possible exception would be connecting to a staging or production authentication mechanism.
- The database should be mostly empty and resettable
- This environment should be used to test new integrations, plugins, schema changes or other configuration tweaks
- As the data isn’t real, there is no need to be worried about data anonymization
- In this environment, opposite to the development one, a system should be connected only to other staging systems. One possible exception is being connected to a production authentication mechanism.
- The database should be full of test records with comparable structure and volume to production systems
- In this environment, you can have more liberal access policies for external vendors, developers, or any other “parties” involved
- You should use this environment to verify the impact of changes on other systems
- You also should be able to copy the configuration to production
- As the data isn’t real, there is no need to be worried about data anonymization.
- At this stage, only to a few trusted people should have admin access. Should you need to grant an additional access, it should be temporary and monitored
- At this stage, a system should only talk to other production systems
In an environment with multiple systems, each system will produce its own logs. Considering this, it is particularly important to pay special attention to log collection as part of data integration best practices. As part of the data integration best practices, here are a few key points you should keep in mind.
File-based vs database-based logging
Computer systems generally produce log information in discrete log statements. Such statements can be written sequentially to a file or placed in a logging database. Log files are simpler and generally more reliable but have some drawbacks. They are harder to search, especially when there is a log statement per machine in a service.
Logging databases, on the other hand, are easier to search. In addition to that, they support anonymization capabilities, even though this requires more setup. It is also possible to extract log files into a logging database – i.e. a database which sole purpose is to store logs.
Many systems support the ability to have their logs formatted in several different formats. Since you will most likely need to integrate logs from various providers, you need to take care of configuring the systems you use to produce their log statements with the same format. This will make integrating the logs from various different systems a lot easier.
In integration projects, a fair share of troubleshooting revolves around the problem of some information that you’d expect to leave one system and arrive at another while it doesn’t. To find out why, you would typically pull the logs from both systems to check if the data left the first system at all and if yes, why it didn’t reach the destination system.
If you can aggregate all those logs in a central location, it becomes considerably easier to search through them. Such aggregation is implicitly available in logging databases, since such a database is generally shared between servers and systems. For log files, you would need to do some extra setup to pull them from a server that produces them and push to a central store where you can manage them more easily. It is also possible to extract log files into a logging database, such as Graylog.
Occasionally, logs might contain sensitive information. If you place all logs in a central place, it can become a security risk. In order to reduce this risk, you can either:
- Refrain from logging sensitive information in the first place, or
- Configure your logging database to anonymize data
Logs take up space, and space costs money… The General Data Protection Regulation (GDPR) also stipulates that personal information needs to be deleted at some point, whether after some predefined time period or upon request – and deleted completely, including from logs. Therefore, you need a clearly defined strategy for log retention. Often organizations have automated processes to delete certain logged information after a specific period of time.
With this article, our blog series on Data Integration Best Practices has come to an end. To summarize, we have covered the different types of various problems that can occur in data integration projects; We have also addressed the different types of integration, the systems that move data and even the pricing aspect of such a project. Last but not least, we reviewed some practical tips for preparing and running an integration project.
This is not the actual end, though. We are going to combine all separate articles and bring them all out in one ebook, most likely towards the end of August. So, stay tuned by following us on Twitter and LinkedIn!