2. Data engineering

If you use a tool like Google Analytics 4, chances are that you can get all of your data-related work done within its user interface.

However, tools always have an idiosyncratic approach to organizing the data you send to them (schemasAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer. were discussed in the previous Topic). What if this approach doesn’t match what your organization requires? What if you want to join the data produced by this third-party service with other data your organization generates and processes?

Whether it’s storing the data in a data warehouseData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against. for storage or infrequent exploration, whether it’s forwarding the data to enrichment and validation pipelines, whether it’s joining the data with other data sources, or whether it’s just taking full control of the data with proprietary mechanisms – these types of data infrastructuresThe physical components, services, and mechanisms that service an organization's data practices. are commonplace and fairly inexpensive to build and maintain.

As a technical marketer, the terminology and concepts you’ll learn about in this Topic are fundamental to any data-oriented process within an organization. Particularly in digital marketing, you’ll frequently end up working with multiple data sources. These data sources need to converge somehow to give you a comprehensive understanding of data efforts in your organization.

Data infrastructure

Data infrastructureThe physical components, services, and mechanisms that service an organization's data practices. refers to the hardware (virtual or physical) that forms the backbone of the organization’s data practices. It also comprises software that runs on these services and provides the technological framework for handling the data flows.

Example

A data engineer might be tasked with building a system for pulling data from Google Analytics 4 and patching it into the organization’s data pipelines to join it with other data sources. This is a common use case when you want to see, for example, how user interactions with your Google Ads campaign correlate with other activities on your website measured with GA4.

Think of data infrastructureThe physical components, services, and mechanisms that service an organization's data practices. thus as the actual nuts and bolts of the pipeline itself. It determines the applications you’ll need to deploy, the virtual machines you’ll need to rent, the disk space you’ll need to reserve for storage, the firewalls and encryptionWhen information is encrypted, it is obfuscated in such a way that no one without the encryption key should be able to determine what the data actually comprises. you’ll need to establish for networking, and the maintenance processes you’ll need to observe.

Deep Dive

Components of a Google Analytics 4 data pipeline

With your Google Analytics 4 data, you could build a simple pipeline in a cloud platform with these components.

Building a scheduler mechanism for pulling data from Google Analytics 4 periodically (daily, for example).
Utilizing virtual computation resources for transforming this data so that it no longer uses the Google Analytics 4 schemaAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer. but is instead flattened or turned into a bunch of relationalData is organized relationally when it's distributed across columns, rows, and tables in such a way that entries are linked together using keys. For example, a user can be represented by a user_id both in a table of purchases (who made the purchase) and in a table of users (who is the user). data tables to match what the organization uses elsewhere.
Setting up storage systems at each juncture. Data might need to be stored at different parts of the pipeline for different purposes.
Ensuring that networking resources are utilized to allow the data to be transported from one system to the next securely and efficiently.
Configuring integration platforms that make the data available in real-timeReal-time analysis refers to analysis of data that is currently being collected. For example, a publishing media might use real-time data to see how many people are consuming content at any given time. "Real-time" is never really real-time – there's always a latency of at least some milliseconds, usually seconds or even minutes. (row-by-row as the data comes in) and in batches (periodic dumps of the full data set).
Constant engineering of the system for efficiency, availability, security, and latencyAnother word for delay. The higher the latency, the longer the delay between the action and the consequence..

There are lots of different tools for all these different parts of the pipeline, and all the major cloud services offer proprietary solutions for these, too.

A relatively easy way to get started is to use the automatic export of Google Analytics 4 data to Google BigQuery with a simple data management system like Dataform to help build and develop transformations.

Data architecture

While data infrastructureThe physical components, services, and mechanisms that service an organization's data practices. is more concerned with how the physical environments for data services are built, data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions. is about the high-level structuring of data to meet the needs of the organization.

Example

When data is stored at any given time, it needs to be stored in some format. Determining the naming conventionsA set of instructions for how to name keys within the Data Layer. For example, a naming convention could dictate that all keys need to be in snake_case rather than in camelCase. for files and how they are logically ordered in a file system is an important part of data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions..

Similarly, defining the data model for the data in terms of what columns, keys, and value types the data should be aligned with is an architectural effort that requires quite a bit of planning.

Data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions. is thus concerned with how the data is structured and ordered at different parts of the pipeline. It defines the quality of the data as it passes through different junctions, it outlines the governance conditions for the data, it establishes how the data flows between different services, and it overlaps with data infrastructureThe physical components, services, and mechanisms that service an organization's data practices. in terms of making sure the data is stored and processed in optimal and scalable ways.

Together, data infrastructureThe physical components, services, and mechanisms that service an organization's data practices. and architecture provide a comprehensive outline of an organization’s data ecosystem.

Deep Dive

Data architecture components

There’s a lot of overlap between data infrastructureThe physical components, services, and mechanisms that service an organization's data practices. and data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions.. The goal of both is to establish how data flows within an organization in efficient, secure, compliant, and meaningful ways.

Data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions. goes beyond the physical attributes of the data pipeline and is more concerned with data models, schemasAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer., and logical mapping of how the data flows between pipeline components.

When data is stored at any given time, it needs to be stored in some format. This means that there needs to be a data model in place that establishes the columns, keys, and values that are populated with the raw data.
Files need to be named and stored in a structured fashion. Naming conventionsA set of instructions for how to name keys within the Data Layer. For example, a naming convention could dictate that all keys need to be in snake_case rather than in camelCase. and file organization patterns are a classic and never-ending debate.
When data is transported from one part of the organization to another, the data flows must be mapped. How does data move from collection to storage? How and when is it processed and transformed? Where can it be consumed? These decisions need to be mapped so that the data flows themselves can be traceable.
Different conditions for data governance and quality apply to different parts of the pipeline. An important aspect of data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions. is to make sure that schemasAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer. and data models are kept up-to-date, and that data lineage can be traced from the end product all the way back to its origin.
Data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions. and infrastructure engineering overlap when it comes to optimizing and scaling the system. The data might need to be partitionedPartition means to logically split data in some way. A typical partitioning method is by date, where data is stored in partitions by date of ingestion. Partition can also reference other aspects of the digital world, such as when browser storage is partitioned (stored separately) by website., indexed, and cachedCaches are temporary storage mechanisms for frequently accessed things such as domain name queries, images, even entire web pages. Their purpose is to make the web faster. to avoid placing undue stress on computation and storage resources. The system needs to be flexible so that it can handle unexpected surges, and so that it can scale down when computation is not in high demand.
Security and compliance are increasingly important aspects of data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions. design. Access controls, encryptionWhen information is encrypted, it is obfuscated in such a way that no one without the encryption key should be able to determine what the data actually comprises. policies, and security and protection measures are all critical to avoid data breachesA security incident that results in unauthorized access to confidential information. and security incidents. Furthermore, data storage and usage must comply with regulations and legal standards.

Don’t miss this fact!

Where data infrastructureThe physical components, services, and mechanisms that service an organization's data practices. is concerned with the components, physical structures, and applications of the pipeline, data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions. focuses on the strategic design and organization of data across this infrastructure. Both are equally essential in a data-informed organization.

Databases, data warehouses, and data lakes

While they sound like they are just different names for the same thing, databasesStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database., data warehousesData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against., and data lakesCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried. each serve a unique purpose in a modern data ecosystem.

A databaseStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database. is generally a structured collection of data that can be easily accessed, managed, and updated.

A typical databaseStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database. would serve a singular purpose. It could store the financial records of the company, or the company’s customer data, or the access logs collected by the company’s web servers.

A data warehouseData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against. is a large, centralized repository of data. They consolidate data from different sources (such as different databasesStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database.), and they’re designed with denormalizedDenormalization in data schemas means that redundancy is intentionally introduced by joining data from multiple tables. The same bits of information can appear multiple times in the data warehouse. This optimizes the data for read-heavy operations, such as that required by data analytics processes. structures to optimize for query speed and analytical processing.

Example

If you wanted to join Google Analytics 4 and Google Ads data together, you could store them in the same data warehouseData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against.. You’d build schemasAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer. and data models that allow you to query these two different data sets, using the user or session or ad campaign as the key components of the queries.

A data lakeCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried. is a storage repository that hosts vast amounts of “raw” data in its native format. This data remains untouched until it’s needed. Unlike with a data warehouseData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against., there wouldn’t be a schemaAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer. applied to the data when it’s stored. Instead, the schemaAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer. is applied at query time.

Data lakesCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried. can also store data in any format – structured or unstructured.

Example

Modern hybrid approaches such as Open Table Formats (e.g. Apache Iceberg and Delta Lake) introduce abstractions on top of the data lakeCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried. to make the process more similar to that of a data warehouseData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against.. In these “transactional” data lakesCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried., the data is modelled and queried as with a data warehouseData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against. but with the storage cost benefits of a data lakeCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried..

A company might use a data lakeCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried. to dump financial data together with behavior data, customer data, and advertising data. The storage itself wouldn’t place restrictions on how the data can be queried, which means that it allows for a more ad hoc approach than a data warehouseData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against..

Ready for a quick break?

Take a break and imagine you are sitting on the shore of a calm and tranquil (data) lake. Remember that there’s more to life than figuring out the appropriate structure for data services in your organization.

Here’s a very rough overview of the three systems. These terms are often used interchangeably, and there’s a lot of overlap between the use cases in particular.

Feature	DatabaseStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database.	Data warehouseData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against.	Data lakeCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried.
Primary purpose	Manage transactional data for a single purpose	Support complex queries and analysis across different data sources	Store large amounts of raw data
Data type	Structured	Structured (typically)	Can be structured and/or unstructured
SchemaAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer.	Defined when the databaseStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database. is created	Defined when the data warehouseData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against. is created	Defined when the data is being queried
Storage	Optimized for frequent updating of records (transactional)	Optimized for complex analytics queries	Data is stored “raw” in its native format
Query performance	High for transactional queries	High for analytical queries	Varies by use case, and what data / tools are needed
Typical user	DatabaseStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database. administrator, developer, financial controller	Digital marketers, business analysts, data scientists	Business intelligence analysts, data scientists, data engineers

Each type of structure has its uses. A modern data-informed organization would typically rely on databasesStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database. and data warehousesData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against. to serve both transactional and analytical needs. However, when the organization needs to process vast quantities of data for big data processing, analytics, and complicated machine learning use cases, a data lakeCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried. might become relevant.

Integrations and transformations

All the data wrangling, infrastructure planning, and architecture design amount to nothing if the data isn’t actually used for anything.

In digital marketing, for example, companies have a multitude of different data sources to contend with. These include sources like:

Web analytics data (Google Analytics, Adobe Analytics, etc.)
Ad platforms (Google Ads, Facebook Ads, LinkedIn Ads, etc.)
CRMSoftware for managing all your organization's relationships and interactions with customers and potential customers. systems (Salesforce, HubSpot, Microsoft Dynamics, etc.)
Email marketing (Mailchimp, ConvertKit, SendGrid, etc.)
Social media (Instagram, Pinterest, TikTok, etc.)

Even though these represent different channels, the user who generates the data is always the same. Thus, a robust data pipeline should support integrations where the data produced by the pipeline can be appropriately transformed to produce as comprehensive a view into all the user’s touchpoints as possible.

The pressure this places on data engineering is immense. Not only does the pipeline need to be able to pull in data from vastly different types of data sources, it also needs to handle transformations to make sure that the different schemasAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer. can be observed together as different facets of the user’s interactions with the company brand.

From the vantage of digital marketing, integrations make data from different marketing touchpoints accessible in a centralized location. Transformations, on the other hand, keep the data clean, meaningful, and structured in a way that makes it usable in analysis and reporting.

Example

You have your web analytics data in one table, and you have ad click data in another table. While they come from two completely separate systems, they have one thing in common: the ad click identifierMany advertising platforms add identifiers to link addresses that indicate the click originated from the advertising platform. These click identifiers can be used by advertisers to link activities on a website with activities on the advertising platform., which is stored in the ad data table when the ad is clicked, and in the web analytics data table when the user lands on the site.

With these two tables, you could run a transformation, which joins this information together into a new table. The new table could align the ads that were clicked with the corresponding session data from the website. This gives you a comprehensive view into how an ad click translated into different (hopefully meaningful) interactions on the website itself.

What any analyst ultimately wants is a way to understand the journey of a key data point (such as a user) across different systems and data sources that contain traces of that data point.

Being able to quickly find answers to questions that span different data sources and services is one of the goals of a data-informed organization.

As a technical marketer, you might be tasked with understanding and sometimes even building the capabilities to help answer these questions. The more you can liaise between the “askers” and the “builders” the more vital your role in the data organization will be.

Key takeaway #1: Data infrastructure is the structural system of the pipeline

Data infrastructureThe physical components, services, and mechanisms that service an organization's data practices. determines the applications, servers, virtual machines, disk space, firewalls, encryptionWhen information is encrypted, it is obfuscated in such a way that no one without the encryption key should be able to determine what the data actually comprises. methods, and maintenance processes you need for the data pipeline. It’s the physical setup of the pipeline, but it does have overlap with data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions., since storage data model determinations might have a direct impact on what type of hardware to utilize, for example.

Key takeaway #2: Data architecture

Data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions. is concerned with how data is actually structured, ordered, and stored at different parts of the pipeline. It sets parameters for data quality, it establishes governance policies for the data, and how data flows between different services. Data architectureHow data is structured, stored, and utilized within an organization. It's a collection of data models, schemas, data flow maps, and governance instructions. is the glue that binds the collected data in most optimal ways to the data infrastructureThe physical components, services, and mechanisms that service an organization's data practices..

Key takeaway #3: Databases, data warehouses, and data lakes

While there are other storage options, too, databasesStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database., data warehousesData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against., and data lakesCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried. are the ones you’ll most often come across when working with analytics systems. DatabasesStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database. are typically single-purposes storage systems for transactional data that needs to be frequently read and written. Data warehousesData warehouse is a repository of data collected by an organization from different sources. The data can then be transformed within the data warehouse before being made available for querying against. combine data from different sources, transforming it at storage time to match a specific schemaAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer.. Data lakesCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried. also combine data from different sources, but the data is stored in its native, source format. SchemasAn analytics system uses schemas to parse, validate, and store events ingested by the collector. The schema dictates what a valid event looks like, what data types are accepted by the system, and what values are required in all incoming events. Schema can also be used to describe the structure of other things, such as the Data Layer. are applied either at query time or via abstractions that run on top of the data lakeCollection of raw data from different sources in their native formats. Schemas are not applied until the data lake is queried. itself.

What did you think about this topic?

Thanks for your feedback!

Data infrastructure

Example

Deep Dive

Components of a Google Analytics 4 data pipeline

Data architecture

Example

Deep Dive

Data architecture components

Don’t miss this fact!

Databases, data warehouses, and data lakes

Example

Example

Ready for a quick break?

Integrations and transformations

Example

Key takeaway #1: Data infrastructure is the structural system of the pipeline

Key takeaway #2: Data architecture

Key takeaway #3: Databases, data warehouses, and data lakes

What did you think about this topic?

Unlock Premium Content

Query GA4 Data In Google BigQuery