zondag 20 augustus 2023

Medallion architecture

Introduction

If you're working in the field of Data engineering, Analytics and Data Warehousing, you probably heared the phrase "Medallion Architecture". Now, me as a data guy I was/am curious about what is executed in every layer and so I did a little investigation about this subject. I searched the Internet and for every layer I copied and pasted the phrases that I found important. So the credits goes to the people/companies that are mentioned in the resources section. So what are the layers, what is the goal of the layer, what is done and what is not done in each layer? I wrote them all down below.


Landingzone Layer

Key Points :

  • The landing zone is an optional Layer. It's not necessary to integrate this in your data architecture.
  • This layer is often needed in situations in which it’s hard to extract data from the target source system.
    • For example, when working with external customers or SaaS vendors. 
  • In such scenarios, you have a dependency or sometimes receive data in an unpreferred (file) format or structure. File formats could be CSV, JSON, XML, Parquet, Delta, and so on.


Bronze Layer

Key points :

  • Goal : 
    • Long Term Storage and Recovery. Organizations often need a place to dump huge amounts of structured and unstructured data for processing and Azure Data Lake Storage is often the cheapest and most efficient way to do say. The Raw layer is designed in a way to effectively act as a sink for further processing as well as a store for data for however long it is required. This can act as a backup for other systems as well.
    • Auditability. Since we don’t do much in terms of additional processing on the data we bring into the Raw layer, the data is unfiltered and unpurified, which is a good thing in terms of auditability. This raw data is always immutable and can act as a reference of what the source system looked like at a given point in time. We usually keep the data in its native format, but this holds true even if we decide to store it in a more modern data format.
  • The data is stored in a Raw format.
  • The layer is used for Ingestion of data.
  • Is appended incrementally and grows over time. For transactional data: can be appended incrementally and grow over time. 
  • Provides the ability to recreate any state of a given data system. Maintains the raw state of the data source in the structure “as-is”. Retains the full (unprocessed) history of each dataset in an efficient storage format, for example, Parquet or Delta. Retaining the full, unprocessed history of each dataset in an efficient storage format provides the ability to recreate any state of a given data system.
  • Can be any combination of streaming and batch transactions.
  • Files in columnar format (should) (Parquet, Delta, Avro). 
  • Additional metadata (such as source file names or recording the time data was processed) may be added to data on ingest for enhanced discoverability, description of the state of the source dataset, and optimized performance in downstream applications. * May include extra metadata, such as schema information, source file names or recording the time data was processed.
  • Additional fields (eg. finding dupplicates).
  • Data is immutable (read-only).
  • Managed using interval partitioned tables, for example, using a YYYYMMDD or datetime folder structure.
  • Datatypes always string. Comment : This seems the same advice in a standard data warehouse solution.
  • Organized by Source.
  • Data Landed in Native Format​ :
    • In an effort to load data as-is, we often keep the data format in the Raw Layer the same as it is in the source. This means our Raw Layer often has CSV and TXTs, depending on the type of source.
    • That being said, if we manage to bring in the data in Parquet format instead, that would be an added bonus.
  • Additional Checks :
    • Schema Validated​.
    • Lake Hierarchy Applied​.
    • Timeliness Assessed​.
    • Elements Tagged​.
    • Completeness and Accuracy Accepted.


Silver Layer

Key points :
  •  Goal : 
    • Enrichment. The Silver layer is often where most of the effort is spent on wrangling the data into a meaningful state. Multiple sources are brought together, the data is refined, and we make sure that it is stored in the most efficient way possible so that the following layers can access data effectively.
    • Discovery. Since the data on the Silver layer is cleaner and more organized, this is generally where most of the data discovery is done. This involves data engineers, data scientists, data analytics and even sometimes business users digging into the data directly to try to gain more context into the data available in the higher layers. The Silver layer provides a balance between the perk of having long term data retention (Bronze) and having cleaner and organized data (Gold).
    • Reusability. The Silver layer is also where reusable data sets are created and stored so that different teams creating different higher layers (data marts or warehouses) can reuse the same data sets without having to go through the trouble of creating it again themselves.
  • Apply Delta Format to the Silver Layer.
  • Perform consolidation and standardization.
  • Filter the data.
  • Clean the data.
  • Augment the data. The goal of data augmentation is to increase the diversity and variability of the training data, which can lead to improved model performance and generalization on unseen data.
  • Standardize the data.
  • The data is validated.
  • The data is normalized. Data modellering 3NF-a-Like, Data Vault-a-Like.
  • This layer is write-Optimized.
  • Organized in domains. In the Silver layer the data is organized by Domain, such as Customer or Product. Comment : This seems to me entities (not domains).
  • Files in columnar format (should) (Parquet, Delta).


Gold Layer

Keypoints :

  • The data is structured for read optimization. Consumer Specific Data Models -> One of the ways of getting the data ready for consumption is molding it into a data model that allows for quick and efficient access while being scalable as well. 
  • Business level aggregates are applied
  • Files in columnar format (should) (Parquet, Delta). Data is stored in an efficient storage format, preferably Delta.
  • Gold tables represent data that has been transformed for consumption or use cases.
  • Gold uses versioning for rolling back processing errors.
  • Historization is applied only for the set of use cases or consumers. So, Gold can be a selection or aggregation of data that’s found in Silver.
  • In Gold you apply complex business rules. So, it uses many post-processing activities, calculations, enrichments, use-case specific optimizations, etc.
  • Data is highly governed and well-documented.
  • Enforce Quality. Although, some standardizations is applied on the Silver Layer, but it’s important to remember that the Gold layer will be the version of truth that is exposed to customer dashboards and other downstream reporting systems. So, we want to make sure that the data is consumable and clean, not just in terms of the data quality of specific fields but in terms of the entire data presented. We can do things like adding filters to reduce the amount of data being pushed forward if it does not benefit that consumer’s use cases and we can aggregate data where applicable too.
  • Security. Since the Gold Layer will be the go-to place for most of the organization to get their data, it’s a good idea to focus on the security of this data too. Things like Row Level Security and Data Masking are often used here in addition to the normal Role Based Access Control mechanisms.


Resources

For this blogpost I've used the following resources. I took the keypoints from these resources, deduplicate the items or combined them into a new keypoint:

  • https://www.mssqltips.com/sqlservertip/7689/data-lake-medallion-architecture-to-maintain-data-integrity/
  • https://www.jamesserra.com/archive/2019/10/databricks-delta-lake/
  • https://www.advancinganalytics.co.uk/blog/medallion-architecture
  • https://www.databricks.com/glossary/medallion-architecture
  • https://microsoft.github.io/WhatTheHack/060-SynapseAndDatabricks/
  • ChatGPT


Final thoughts

After years where people claimed that the "Data Warehouse is dead", it seems to me that a lot is coming back to the Data Warehouse principles according to the defintion of Inmon. Even Datavault is mentioned as a proposed integration solution in the silver layer. 


Some people define a Data Warehouse as a technical platform/solution (for instance, in Microsoft Fabric the SQL engine is called Data Warehouse) but a Data Warehouse is not a technical solution but a solution that you build on top of platforms like Data bricks, Snowflake or Microsoft Fabric. 


Hennie

1 opmerking: