zondag 25 oktober 2020

DevOps series : Working with teams in a data warehouse (part I)


Perhaps, you have been in this situation: imagine the following problem, you have a data warehouse, and it is shared between a large number of teams and you have implemented a version control and a branching strategy like Gitflow or Microsoft flow and you think that you have it all but somehow code is not flowing in cadence into production. Symptoms are that developers are complaining that code breaks because other teams deployed their code too or it's unknown which team is responisble for data objects.

This blogpost is a collection of my experiences at several customers in the last couple of years. I've been working in companies with multiple teams, mostly working with Microsoft technology in data environments. I've been working in DevOps environments where we were working as a team, delivering information products to our customers in the business, together with other teams. This blogpost is a mix of knowledge and experiences of Data-, DevOps- and Lean principles!

Continuous delivery and a data warehouse

We, as data professionals are working in data environments (eg. data warehouses) where we want a common data platform, where entities are well formed designed and structured. We think about the business and we try to model the data according to the processes and the way of working (in a part of) of the company. This will result in a common data model implemented in a database. 

I've seen that multiple teams are working on separate parts of the database, but sometimes teams are dependent on the work of other teams because they use data entitites of other teams. Sometimes there are entities that are hotspots of usage between different teams. Think about entities like Customer, Employee or Organization. These are common used entities in a data warehouse. But there are also entities that are not used by other teams in their part of the data architecture. 

In the example above, Team 3 has dependencies between objects that are maintained by team 1 and 2. This could be an ETL dependency. Now, if one of the teams changes the referenced object, Team 3 has a problem. The ETL logic will result in a 'failure'. This is a very common situation when teams are working in a common data architecture and this is an annoying problem and it is hard to fix.

Final thoughts

What if we could find a way for teams to deliver their code as much as possible independent from the other teams to production and where dependencies are there, can we manage these dependencies? Are there ideas on how solve the ambiquity, who is responsible for what, and can we use methods like branching by abstraction to manage dependencies between teams?

How can we achieve a complete data platform (data warehouse) with the advantages like subject-oriented, integrated, time-variant and non-volatile (Inmon) and deliver your products as fast as possible to the business. Data warehouses had a bad name in the past, and data vault and other flexible techniques helped to deliver faster to the business. An extra challenge is the number of teams that work on a data ware house. How should they balance their priorities between speed of delivery and the integrity of a data warehouse? If they focus on the speed of delivery, less attention is paid to the integrity and the other way around focus (only) on the integrity, less and less is delivered to the business. Off course integrity of the data warehouse is important, but delivering code to production is as important as well.

This is the first blogpost about DevOps and lean principles and working in data environments like data warehouses. I'm planning to write more about this topic in the future. 


Geen opmerkingen:

Een reactie posten