Refer And Earn Free Masterclass Enterprise

Mastering Data Mesh: Implementing Architecture in Azure

Updated on 21th July, 2024

190k views

10 min Read

Introduction

Implementing Data Mesh Architecture In Azure

Since people started building software systems for data analysis, the logical architecture and composition of the teams building such systems have remained more or less centralised. Typically, data from various business domains(products, finance, sales, etc.) are moved to a central “place” such as a data warehouse, data lake, or lake house and transformed for analytical and reporting purposes by a single team.
These systems work fine when they are low in complexity. However, they become difficult to evolve and less useful while implemented across a large, complex organisation or when organisational complexity increases due to changes in the business context.
Data mesh tries to solve these problems using a decentralised approach to analytical data systems. It is a sociotechnical approach to building decentralised systems for analysing and sharing data at scale.

What is Data Mesh?

Data Mesh is a revolutionary concept introduced by Zhamak Dehghani in 2019. It challenges the centralised data monoliths prevalent in traditional data architectures and suggests a decentralised, domain-oriented approach. The core idea is to treat data as a product, with each domain responsible for its data products, ensuring scalability, autonomy, and ease of management.

Data Mesh principles

In data mesh, each domain team becomes responsible for producing their analytical data and sharing it, this is the first principle of data mesh— Domain Ownership of data However, this introduces additional complexity, especially around interoperability and governance of data. To address this, data mesh advances three other principles.

Data as a product
Self-Service Data Platform
Federated computational Governance

Key Principles of Data Mesh

Domain-oriented decentralised data ownership: Assign data ownership to individual domains within the organisation, ensuring that each domain is responsible for its data products.
Data as a product: Treat data as a product with well-defined APIs, contracts, and service-level agreements (SLAs) that facilitate seamless data exchange and consumption.
Federated computational governance: Implement federated governance models that allow each domain to have autonomy over its data while adhering to organisation-wide standards and policies.
Self-serve data infrastructure as a platform (DIaaP): Provide a self-serve platform that enables domains to manage their data infrastructure, reducing dependencies on centralised data teams

Data Mesh implementation in Azure

Data ingestion and transformation

A critical capability required for building data products is to ingest and transform(ETL) domain data sets to read-optimised output data sets. From the data mesh perspective, support for ingesting data from varied types of sources based on interface-specification files written in commonly used formats like JSON, XML is a significant advantage because it makes data product interoperability easier to implement.

While building a centralised data platform in Azure, you can use many services to implement data ingestion and transformation capabilities. For example — Azure Synapse pipelines and Dataflows, Azure data factory pipelines and dataflows, Azure Synapse Spark, Azure Databricks Spark, etc. These are essentially managed services that run a programming language like Python — as in the case of PySpark in Azure Databricks or low-code, GUI-based services like Azure Synapse/ADF pipelines and dataflows. You can also use a combination of these capabilities.
Data products are built by a cross-functional team of domain experts, analysts, data engineers, etc. Not all of them will have the specialised skills required to work with some of these services and tools. Therefore, understandability, ease of use, and interoperability are at least as necessary as scalability, performance, cost, etc. — often the most prioritised factors in service selection for a centralised data platform.

Data storage and retrieval

In centralised analytical data systems, achieving the required scale regarding the volume of data that can be stored, processed, and retrieved is often a critical concern. However, similar to data ingestion and transformation capabilities, interoperability, understandability, and ease of use are equally important. One of the self-serve data platform’s core concerns is enabling data product teams of “technology generalists,” e.g., people with SQL and Excel skills, to create and manage data products. Therefore data mesh envisions polyglot data storage retrieval services, allowing flexibility for the data product team to choose an appropriate one for their context while striving to keep alignment with governance objectives through global policies and platform APIs.
Ultimately, the selection of storage and service depends on their fitment for data product use cases and alignment to cross-cutting governance objectives like data encryption and access control. Generally speaking, cloud platforms like Azure offer uniform services and APIs like Azure Policy, Azure Monitor, Azure RBAC, etc., for implementing security, access control, logging, and monitoring. Therefore, it is possible to offer more varied storage services to the data product teams without significantly increasing the development complexity of self-serve data platforms or computational governance.

Data lineage and classification

A data governance tool is usually a good starting point for extracting data lineage and classifying data according to predefined classification rules. For example, using Azure Purview, you can automatically scan data sources like Azure Data Lake, Azure SQL DB, and extract metadata from your data assets like files and tables. If confidential information like credit card no or social security number is present, Purview will classify the data assets and the columns accordingly. This is very useful for implementing governance concerns like data privacy across data products

Logging and monitoring

Support for operational and analytical applications and uniformity of capabilities and APIs are crucial considerations for selecting logging and monitoring services for data mesh. Therefore, Azure monitor is an ideal choice here. It provides the same capabilities and API across operational and analytical services, excellent reporting and dashboard support, and the ability to query and analyse metrics and logs using Kusto, which is very similar to SQL. You can also send external log data to Azure monitor using its REST API.

Identity and access control

An approach based on modern auth protocols is essential to get interoperability between data products within the mesh and outside. Azure Active Directory is the cornerstone of identity and access control in Azure. It has excellent support for modern authentication and authorization protocols — Open Id connect and OAuth 2.0 as well as legacy protocols like SAML 2.0. Along with Azure RBAC, Azure AD offers the necessary foundation for identity and access control for data mesh.

Implementing Data Mesh in Azure requires a thoughtful approach, leveraging the rich set of services provided by the platform. By embracing decentralised data ownership, treating data as a product, and adopting domain-oriented architectures, organisations can unlock the true potential of their data while harnessing the scalability and flexibility offered by Azure. The journey towards Data Mesh in Azure is not just a technical transformation but a cultural shift that empowers domains to take ownership of their data, fostering innovation and agility in the ever-evolving world of data management.