🥳 You have completed all topics in the Handbook. Click here to claim your certificate!

2. Data engineering

Working with data infrastructures within an organization might be relevant even if you are using third-party solutions that promise to do all of this for you. Creating and managing these infrastructures falls under the umbrella of data engineering.

If you use a tool like Google Analytics 4, chances are that you can get all of your data-related work done within its user interface.

However, tools always have an idiosyncratic approach to organizing the data you send to them (schemas were discussed in the previous Topic). What if this approach doesn’t match what your organization requires? What if you want to join the data produced by this third-party service with other data your organization generates and processes?

Whether it’s storing the data in a data warehouse for storage or infrequent exploration, whether it’s forwarding the data to enrichment and validation pipelines, whether it’s joining the data with other data sources, or whether it’s just taking full control of the data with proprietary mechanisms – these types of data infrastructures are commonplace and fairly inexpensive to build and maintain.

As a technical marketer, the terminology and concepts you’ll learn about in this Topic are fundamental to any data-oriented process within an organization. Particularly in digital marketing, you’ll frequently end up working with multiple data sources. These data sources need to converge somehow to give you a comprehensive understanding of data efforts in your organization.

Data infrastructure

Data infrastructure refers to the hardware (virtual or physical) that forms the backbone of the organization’s data practices. It also comprises software that runs on these services and provides the technological framework for handling the data flows.

Example

A data engineer might be tasked with building a system for pulling data from Google Analytics 4 and patching it into the organization’s data pipelines to join it with other data sources. This is a common use case when you want to see, for example, how user interactions with your Google Ads campaign correlate with other activities on your website measured with GA4.

Think of data infrastructure thus as the actual nuts and bolts of the pipeline itself. It determines the applications you’ll need to deploy, the virtual machines you’ll need to rent, the disk space you’ll need to reserve for storage, the firewalls and encryption you’ll need to establish for networking, and the maintenance processes you’ll need to observe.

Deep Dive

Components of a Google Analytics 4 data pipeline

With your Google Analytics 4 data, you could build a simple pipeline in a cloud platform with these components.

  1. Building a scheduler mechanism for pulling data from Google Analytics 4 periodically (daily, for example).
  2. Utilizing virtual computation resources for transforming this data so that it no longer uses the Google Analytics 4 schema but is instead flattened or turned into a bunch of relational data tables to match what the organization uses elsewhere.
  3. Setting up storage systems at each juncture. Data might need to be stored at different parts of the pipeline for different purposes. 
  4. Ensuring that networking resources are utilized to allow the data to be transported from one system to the next securely and efficiently.
  5. Configuring integration platforms that make the data available in real-time (row-by-row as the data comes in) and in batches (periodic dumps of the full data set).
  6. Constant engineering of the system for efficiency, availability, security, and latency.

There are lots of different tools for all these different parts of the pipeline, and all the major cloud services offer proprietary solutions for these, too. 

A relatively easy way to get started is to use the automatic export of Google Analytics 4 data to Google BigQuery with a simple data management system like Dataform to help build and develop transformations.

Data architecture

While data infrastructure is more concerned with how the physical environments for data services are built, data architecture is about the high-level structuring of data to meet the needs of the organization.

Example

When data is stored at any given time, it needs to be stored in some format. Determining the naming conventions for files and how they are logically ordered in a file system is an important part of data architecture.

Similarly, defining the data model for the data in terms of what columns, keys, and value types the data should be aligned with is an architectural effort that requires quite a bit of planning.

Data architecture is thus concerned with how the data is structured and ordered at different parts of the pipeline. It defines the quality of the data as it passes through different junctions, it outlines the governance conditions for the data, it establishes how the data flows between different services, and it overlaps with data infrastructure in terms of making sure the data is stored and processed in optimal and scalable ways.

Together, data infrastructure and architecture provide a comprehensive outline of an organization’s data ecosystem.

Deep Dive

Data architecture components

There’s a lot of overlap between data infrastructure and data architecture. The goal of both is to establish how data flows within an organization in efficient, secure, compliant, and meaningful ways.

Data architecture goes beyond the physical attributes of the data pipeline and is more concerned with data models, schemas, and logical mapping of how the data flows between pipeline components.

  1. When data is stored at any given time, it needs to be stored in some format. This means that there needs to be a data model in place that establishes the columns, keys, and values that are populated with the raw data.
  2. Files need to be named and stored in a structured fashion. Naming conventions and file organization patterns are a classic and never-ending debate.
  3. When data is transported from one part of the organization to another, the data flows must be mapped. How does data move from collection to storage? How and when is it processed and transformed? Where can it be consumed? These decisions need to be mapped so that the data flows themselves can be traceable.
  4. Different conditions for data governance and quality apply to different parts of the pipeline. An important aspect of data architecture is to make sure that schemas and data models are kept up-to-date, and that data lineage can be traced from the end product all the way back to its origin.
  5. Data architecture and infrastructure engineering overlap when it comes to optimizing and scaling the system. The data might need to be partitioned, indexed, and cached to avoid placing undue stress on computation and storage resources. The system needs to be flexible so that it can handle unexpected surges, and so that it can scale down when computation is not in high demand.
  6. Security and compliance are increasingly important aspects of data architecture design. Access controls, encryption policies, and security and protection measures are all critical to avoid data breaches and security incidents. Furthermore, data storage and usage must comply with regulations and legal standards.

Don’t miss this fact!

Where data infrastructure is concerned with the components, physical structures, and applications of the pipeline, data architecture focuses on the strategic design and organization of data across this infrastructure. Both are equally essential in a data-informed organization.

Databases, data warehouses, and data lakes

While they sound like they are just different names for the same thing, databases, data warehouses, and data lakes each serve a unique purpose in a modern data ecosystem.

A database is generally a structured collection of data that can be easily accessed, managed, and updated.

A typical database would serve a singular purpose. It could store the financial records of the company, or the company’s customer data, or the access logs collected by the company’s web servers.

A data warehouse is a large, centralized repository of data. They consolidate data from different sources (such as different databases), and they’re designed with denormalized structures to optimize for query speed and analytical processing.

Example

If you wanted to join Google Analytics 4 and Google Ads data together, you could store them in the same data warehouse. You’d build schemas and data models that allow you to query these two different data sets, using the user or session or ad campaign as the key components of the queries.

A data lake is a storage repository that hosts vast amounts of “raw” data in its native format. This data remains untouched until it’s needed. Unlike with a data warehouse, there wouldn’t be a schema applied to the data when it’s stored. Instead, the schema is applied at query time.

Data lakes can also store data in any format – structured or unstructured.

Example

Modern hybrid approaches such as Open Table Formats (e.g. Apache Iceberg and Delta Lake) introduce abstractions on top of the data lake to make the process more similar to that of a data warehouse. In these “transactional” data lakes, the data is modelled and queried as with a data warehouse but with the storage cost benefits of a data lake.

A company might use a data lake to dump financial data together with behavior data, customer data, and advertising data. The storage itself wouldn’t place restrictions on how the data can be queried, which means that it allows for a more ad hoc approach than a data warehouse.

Ready for a quick break?

Take a break and imagine you are sitting on the shore of a calm and tranquil (data) lake. Remember that there’s more to life than figuring out the appropriate structure for data services in your organization.

Here’s a very rough overview of the three systems. These terms are often used interchangeably, and there’s a lot of overlap between the use cases in particular.

FeatureDatabaseData warehouseData lake
Primary purposeManage transactional data for a single purposeSupport complex queries and analysis across different data sourcesStore large amounts of raw data
Data typeStructuredStructured (typically)Can be structured and/or unstructured
SchemaDefined when the database is createdDefined when the data warehouse is createdDefined when the data is being queried
StorageOptimized for frequent updating of records (transactional)Optimized for complex analytics queriesData is stored “raw” in its native format
Query performanceHigh for transactional queriesHigh for analytical queriesVaries by use case, and what data / tools are needed
Typical userDatabase administrator, developer, financial controllerDigital marketers, business analysts, data scientistsBusiness intelligence analysts, data scientists, data engineers

Each type of structure has its uses. A modern data-informed organization would typically rely on databases and data warehouses to serve both transactional and analytical needs. However, when the organization needs to process vast quantities of data for big data processing, analytics, and complicated machine learning use cases, a data lake might become relevant.

Integrations and transformations

All the data wrangling, infrastructure planning, and architecture design amount to nothing if the data isn’t actually used for anything.

In digital marketing, for example, companies have a multitude of different data sources to contend with. These include sources like:

  • Web analytics data (Google Analytics, Adobe Analytics, etc.)
  • Ad platforms (Google Ads, Facebook Ads, LinkedIn Ads, etc.)
  • CRM systems (Salesforce, HubSpot, Microsoft Dynamics, etc.)
  • Email marketing (Mailchimp, ConvertKit, SendGrid, etc.)
  • Social media (Instagram, Pinterest, TikTok, etc.)

Even though these represent different channels, the user who generates the data is always the same. Thus, a robust data pipeline should support integrations where the data produced by the pipeline can be appropriately transformed to produce as comprehensive a view into all the user’s touchpoints as possible.

The pressure this places on data engineering is immense. Not only does the pipeline need to be able to pull in data from vastly different types of data sources, it also needs to handle transformations to make sure that the different schemas can be observed together as different facets of the user’s interactions with the company brand.

From the vantage of digital marketing, integrations make data from different marketing touchpoints accessible in a centralized location. Transformations, on the other hand, keep the data clean, meaningful, and structured in a way that makes it usable in analysis and reporting.

Example

You have your web analytics data in one table, and you have ad click data in another table. While they come from two completely separate systems, they have one thing in common: the ad click identifier, which is stored in the ad data table when the ad is clicked, and in the web analytics data table when the user lands on the site.

With these two tables, you could run a transformation, which joins this information together into a new table. The new table could align the ads that were clicked with the corresponding session data from the website. This gives you a comprehensive view into how an ad click translated into different (hopefully meaningful) interactions on the website itself.

What any analyst ultimately wants is a way to understand the journey of a key data point (such as a user) across different systems and data sources that contain traces of that data point.

Being able to quickly find answers to questions that span different data sources and services is one of the goals of a data-informed organization.

As a technical marketer, you might be tasked with understanding and sometimes even building the capabilities to help answer these questions. The more you can liaise between the “askers” and the “builders” the more vital your role in the data organization will be.

Key takeaway #1: Data infrastructure is the structural system of the pipeline

Data infrastructure determines the applications, servers, virtual machines, disk space, firewalls, encryption methods, and maintenance processes you need for the data pipeline. It’s the physical setup of the pipeline, but it does have overlap with data architecture, since storage data model determinations might have a direct impact on what type of hardware to utilize, for example.

Key takeaway #2: Data architecture

Data architecture is concerned with how data is actually structured, ordered, and stored at different parts of the pipeline. It sets parameters for data quality, it establishes governance policies for the data, and how data flows between different services. Data architecture is the glue that binds the collected data in most optimal ways to the data infrastructure.

Key takeaway #3: Databases, data warehouses, and data lakes

While there are other storage options, too, databases, data warehouses, and data lakes are the ones you’ll most often come across when working with analytics systems. Databases are typically single-purposes storage systems for transactional data that needs to be frequently read and written. Data warehouses combine data from different sources, transforming it at storage time to match a specific schema. Data lakes also combine data from different sources, but the data is stored in its native, source format. Schemas are applied either at query time or via abstractions that run on top of the data lake itself.

Quiz: Data Engineering

Ready to test what you've learned? Dive into the quiz below!

1. Why is it important to consider integrations when building a data pipeline?

2. What are features of a data lake?

3. What is data infrastructure concerned with?

Your score is

0%

What did you think about this topic?

Thanks for your feedback!

Unlock Premium Content

Simmer specializes in self-paced online courses for technical marketers. Take a look at our offering and enroll in one or more of our courses!

Online course

Query GA4 Data In Google BigQuery

Learn SQL and how to put it to use with one of the most popular data warehouse systems around: Google BigQuery.