🥳 You have completed all topics in the Handbook. Click here to claim your certificate!

1. How does an analytics system work?

An analytics system is a combination of different modules that work seamlessly together. Each module participates in the effort to deliver data that can be used for measurement planning and insight generation.

For many people who work with data on a daily basis, generating and viewing reports are their main data-related activities.

There’s nothing as wholesome as building a nice-looking report with labels and tips on how best to interpret the data, and with toggles and selectors for further refining the queries.

But how did the data in those reports come to be? Who compiled them? Can they be trusted? Was the data collected fairly? Is it representative?

It’s not a hugely overstated estimation to say that most people don’t consider these things. They never stop to consider that there’s a huge and complicated system of interconnected parts generating the data in those reports.

The data needs to be collected, processed, and stored before it can be made available to other processes in the organization.

As someone involved in technical marketing, you don’t need to know every detail of this system. But understanding the lifecycle of a single data point – from collection to activation – is crucial.

Decisions made at every milestone of this lifecycle have cascading consequences on all the subsequent steps and, due to the cyclical nature of many data-related processes, often on the preceding steps, too.

In the following sections, we’ll take a look at how different parts of a prototypical analytics system introduce decisions and friction points that might have a huge impact on the interpretations that can be drawn from the data.

Our exploration begins with the tracker, a piece of software designed to capture event data from the visitor’s browser or app. The tracker compiles this information and dispatches it to the collector, which is the entry point of the analytics server. The collector hands the validated data to a processor, where the information is aligned with a schema, and finally the data is stored for access via queries, integrations, and APIs.

Deep Dive

Directed vs. cyclical systems

While it’s customary to describe analytics systems as a directed graph from one component to the next, the reality is often more complex.

For example, if the collector detects data quality issues, it might not forward the hit to the next component (processing), but instead store it in a temporary repository where the event waits to be “fixed”. After fixing, it can be fed back to the original pipeline.

Similarly, if at any point of the pipeline the hit is qualified as spam or otherwise unwanted traffic, this information can be returned to the tracker software so that hits that match the signature will no longer be collected at all.

Analytics systems tend to have feedback loops where decisions made at the end of the pipeline can inform processes at the beginning.

The tracker runs in the client

The purpose of the tracker is to offer a software interface designed to collect and dispatch data to the collector.

Typically, the tracker would be a JavaScript library (web analytics) or an SDK (app analytics) that is downloaded and installed from the vendor’s server into the user’s browser or app. 

The tracker adds listeners to the web page or the app, designed to collect information autonomously from the user and dispatch it to the collector server.

Example

When you navigate to a web page, the tracker activates and detects a page_view event from you. It collects metadata such as the page address, your browser and device details, the marketing campaign that brought you to the page, and even your geographic location, before dispatching the event to the vendor.

When you then scroll down the page a little to read the content below the fold, a scroll event is detected by the tracker, and this, too, is dispatched to the collector server. Additional metadata would be just how deep you scrolled (in percentages or pixels scrolled) and maybe how long you’ve spent on the page.

Once the tracker has detected an event and gathered all the data it needs, it is ready to dispatch the event to the vendor.

This would usually be an HTTP network request to the collector server, where the data payload is included in the request address itself or the request body.

Deep Dive

Autonomous collectors

Because trackers are often designed by the analytics system vendors and because they have autonomous access to their execution environment (the web page or the app), they tend to collect more information than is necessary.

Typically, trackers collect information such as:

  1. Unique identifiers that distinguish users from others
  2. Details about the device, browser, and operating system of the user
  3. Information about the marketing campaign that initiated the current visit
  4. Metadata about the page or screen the user is currently viewing
  5. Additional metadata about the user, often stored in databases accessible with cookie tokens

Naturally, vendors want as much data as they can collect. But it might surprise their customers just how much detail can be collected using browser and mobile app technologies.

The most dangerous trackers, from a privacy and data protection point of view, are those that collect a log of everything the user does on a page and that scrape sensitive data (such as email addresses) from the page automatically. These practices run afoul of many data protection laws.

The collector validates and pre-processes

In general terms, the collector is a server. It’s designed to collect the network requests from trackers, and it’s often the entry point into a larger server-side data pipeline for handling that data.

What the collector does immediately after receiving the network request really depends on the analytics system and what role should validation and preprocessing have in the overall pipeline.

In some cases, the collector can be very simple. It writes all incoming requests to log storage, from where analytics tools can then parse the information.

But this would not be a very efficient analytics system. It would just be a glorified logging tool, and analysts working in the organization would not be very grateful for having to parse thousands of log entries just to glean some insights from the dataset.

Instead, collectors typically do initial distribution, pre-processing, and validation of the data based on the request footprint alone. For example:

  1. A request that has ad signals in the URL, for example click identifiers or campaign parameters, could be forwarded to a different processing unit than one that doesn’t.
  2. A request that originates in the European Economic Area could be forwarded to a more GDPR-friendly pipeline than a request that originated elsewhere.
  3. A request that seems to come from a virtual machine environment could be flagged as spam, as it could have been generated by automated bot software.

While it’s not wise to have the collector perform too much pre-processing, as that introduces unnecessary latency to the pipeline, mitigating spam traffic and complying with regional data protection regulations might be necessary to introduce already at this stage and sometimes even earlier, in the tracker software itself.

Once the collector is satisfied that the request has been adequately validated, it is forwarded to processing.

Ready for a quick break?

Now’s a good time to take a small break – walk around for 5 minutes and have a glass of water. 😊

The processor associates the data with a schema

The processing of the data is a complicated activity.

An analytics system makes use of schemas to align the data parsed from the collector into the format required by the different parts of the analytics system.

The schema is essentially a blueprint that determines the structure and utility of the collected data.

At processing time, data enrichment can also happen. For example, the user’s IP address could be used to add geolocation signals to the data, the user properties sent with the hit can be scoped to all hits from the user, and monetary values could be automatically converted between currencies.

Data processing is arguably the most significant part of any analytics system, and it’s what causes the biggest differentiation between different analytics products.

When a vendor decides upon a schema, they are making a decision for all their users on how to interpret the data. Even though many analytics systems offer ways of modifying these schemas or exporting “raw” data where the schema is only minimally applied, most tool users will likely rely on the default settings which might color their analyses a great deal.

For this reason, if you want to use an analytics system efficiently, you need to familiarize yourself with the various schemas it uses.

Deep Dive

The schema establishes the grammar of the data pipeline

The schema issues instructions for translating the events and hits into the semantic structure required by the overall data pipeline. The schema is not just descriptive – it can also include validation criteria for whether fields are required or optional, and what types of values they should have.

Often, the schema instructs how the data should be stored. In data warehouses, for example, the schema is applied at storage time. Thus the schema dictates how the data can ultimately be queried.

For example, let’s imagine this simple data payload in the request URL:

&cid=12345&et=1694157514167&up.status=loyal

Here we have a payload of three key-value pairs. The schema could, then, instruct the following:

Parameter nameValueColumn nameRequired
cidNumericclient_idTrue
etTimestamp (Unix time)event_timestampFalse
up.statusAlphanumericuser_properties.statusFalse
...

If the schema finds a mismatch, for example a missing cid parameter or an et parameter that is not a valid timestamp, it could flag the event as malformed.

Storage makes the data available for queries

Storage sounds simple enough. It’s a data warehouse, where the logged and processed event data is available for queries.

Example

When you open your analytics tool of choice and look at a report, the data in that report is stored somewhere with the specific purpose of populating the report you are looking at.

If you then open a real-time report that shows you data as it’s collected by the vendor server, the data in that report most likely comes from a different storage location, designed to make the data available for the type of streaming access that real-time reports necessitate.

Storage is more than just dumping files in a drive somewhere in the cloud. Analytics systems need to take vast precautions to appropriately encrypt and protect storage so that data ownership and governance clauses are respected. The data needs to be protected against data breaches, too, to avoid leaking potentially sensitive information to malicious attackers.

Additionally, data protection regulations might dictate conditions for how and where the data needs to be stored, and how it can be accessed to comply with various types of data subject access requests, for example when the user wants to delete all their personal data from what the analytics system has collected.

Finally, storage is about utility. The data needs to be stored for some purpose. An important purpose in the data and analytics world is reporting and analysis.

When you use the reporting interface of an analytics tool, or when you use a connector with Google Looker Studio or Tableau, you’re actually looking at the end product of a storage query. 

These connected systems utilize integrations and APIs that allow them to pull data from storage for displaying in a visually appealing format.

Deep Dive

Different storage models for different purposes

An analytics system could distribute data to storage in many different ways, depending on what types of interfaces it offers for accessing the data. For example:

  • HTTP log storage from the collector for ingestion-time analysis
  • Pre-processed data for real-time analysis
  • Short-term storage for daily data (small query size)
  • Long-term storage for historical data (larger query size)
  • Interim storage for hits that need extra processing before they are stored, for example hits that didn’t pass initial validation tests

Even though storage is usually inexpensive, querying that storage isn’t. That’s why large analytics systems typically optimize storage access by using machine learning to return a representative sample of the data rather than the full query result.

By paying a bit more, you can have access to the unsampled datasets, with the caveat that compiling the reports for this data will take longer than when working with samples.

Reports and integrations put the data to use

For many end users, reporting is the main use case for an analytics system.

Some analytics tools offer a built-in reporting suite, which has privileged access to storage for displaying the data in predetermined ways.

However, these days it’s very common to use tools specifically designed for reporting, such as Google Looker Studio, Tableau, and PowerBI, together with the data in the analytics system.

These external reporting tools need to be authenticated for access to the storage in the analytics system. They then compile queries against this storage, and then display the data in graphs, charts, and other reporting user interface components.

To save on costs and computation, the analytics system might not make the entire storage capacity available for queries, instead exposing just a layer for these integrations.

For this reason, analytics systems might also offer a way to write the raw data directly into an external data warehouse so that users can build their own query systems without having to worry about the limitations and restrictions of proprietary storage access.

Broadly speaking, the three types of access described in this section could be categorized as follows:

  1. Reports in the analytics system itself are useful for quick ad hoc analysis with access to the full, processed data set. These reports might be subject to limitations and restrictions for data access, and the types of reports that can be built are predetermined by the analytics system itself.
  2. Reports in an external tool allow you to choose how to build the visualizations yourself. This is useful if you want to use software that your organization is already familiar with, and if you want to keep access to these reports separate from access to the analytics system.
  3. Reports against an external data warehouse are the most powerful way of taking control over reporting. This approach also requires the most work and most know-how, because to be able to query the data in the data warehouse you need to be familiar with the schema of the analytics system and the schema of your own data store. This is often also the most costly option because you will need to field the costs of storage and querying for the data warehouse.

As a technical marketer, the better you understand the journey of a single data point collected by a tracker, pre-processed by a collector, aligned against a schema by the processor, deposited in storage, and made available for reporting and integrations, the better will you be able to organize the data generated by your organization.

Key takeaway #1: Data flows through the pipeline

The combination of data infrastructure and data architecture for an analytics system is called the “data pipeline”. It usually comprises a tracker for collecting information that needs to be tracked, a collector for pre-processing and validating the tracker’s work, a processor for adjusting the data to match a specific schema, and storage for storing the data. There can be many other components in the pipeline, but on a general level these four are what you’d typically encounter.

Key takeaway #2: Schema is the backbone of a data model

The schema is a blueprint that determines the structure and utility of the collected data. It’s what turns the parameter-based data from the tracker into units that resemble each other in type, form, and function. The schema can be used to validate the incoming data, so that events that miss required values are flagged as problematic to be discarded or fixed at a later time. The schema also instructs how the data is stored for later use in queries and integrations.

Key takeaway #3: Activation is the most difficult part of the pipeline

Activation, or how the data is actually utilized, is difficult to encode into the pipeline. That’s because it’s dependent on the business questions, integrated tools, skills of the people who query the data, and the context of how the data is intended to be utilized. Nevertheless, reports and integrations need to be built to support the existing use cases as well as possible future use cases that haven’t yet been thought of.

Quiz: How Does An Analytics System Work?

Ready to test what you've learned? Dive into the quiz below!

1. What would a storage system need to take into account (select all that apply)

2. What is the purpose of the collector?

3. Why are tracker libraries often controversial?

Your score is

0%

What did you think about this topic?

Thanks for your feedback!

Unlock Premium Content

Simmer specializes in self-paced online courses for technical marketers. Take a look at our offering and enroll in one or more of our courses!