1. Crawling, indexing, and querying

When you use a search engine like Google or Bing, you are working with an interface that sits in front of an enormous corpus of text, video, sound, and other media.

Search engines send a never-tiring army of crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. around the web, whose job is to collect information about the web pages they visit.

This information is stored in huge databasesStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database., where the content is further parsed for keywords, structured dataStructured data is a semantic layer on a website, explaining what the website is about and what different entities on the page are. This information is usually provided for bots and crawlers to consume., topics, headings, and other metadataMetadata is additional data about the data itself. For example, in an analytics system the "event" describes that action the user took, and metadata about the event could contain additional information about the user or the event itself.. These databasesStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database. form the index of the search engine.

When you type words into the search engine input bar, your text is converted into a query which looks for relevant content in the search engine’s index.

Figuring out which content matches the query, which content is most relevant, and which content is returned first to the user (these three things might not always correlate!) are all guarded secrets for search engines to prevent content from being manipulated in hopes of better search engine reach.

In digital marketing, improving content availability in search engines is a strong strategy for improving brand awareness and generating leads. This type of organicIn digital marketing, the word "organic" is often used with traffic from search engines, excluding those that came from a search ad click. visibility is typically steadier and more consistent than the type of traffic you’d get from ads and social media nudges.

The discipline of adjusting a website’s technical infrastructure, content, and relationships with other websites is known as Search Engine OptimizationSEO focuses on improving a website's visibility and reach for users who perform relevant queries in a search engine. (SEOSEO focuses on improving a website's visibility and reach for users who perform relevant queries in a search engine.).

When the focus is primarily on making the website easier to crawl and index, we usually talk about technical SEOSEO focuses on improving a website's visibility and reach for users who perform relevant queries in a search engine..

Crawling the web

The basic operational principle of a web crawlerA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. is simple.

The web crawlerA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. loads a web page, collects as much information and metadataMetadata is additional data about the data itself. For example, in an analytics system the "event" describes that action the user took, and metadata about the event could contain additional information about the user or the event itself. about the page as possible, and then it adds the hyperlinksHyperlinks, or just links, are elements on the web page that redirect the user to another page when clicked. it finds on the page into a queue. This queue is then organized based on perceived importance of each link.

When the crawlerA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. visits the next page in the queue, it again collects the metadataMetadata is additional data about the data itself. For example, in an analytics system the "event" describes that action the user took, and metadata about the event could contain additional information about the user or the event itself. and then follows the links on that page, and on and on it goes.

CrawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. thus follow the interconnected structure of the World Wide Web without pause.

It’s not as easy as it sounds, though. Consider the following situations:

Can a web page prevent crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. from crawling it and indexing it?

Yes. There are tools like the robots.txt file and meta robots directives that can be used to guide crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. into what they can or cannot crawl and index.

What if a page has no inbound links? How can crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. find it?

If the page is not linked to by any other page on the web, then crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. will not be able to find it organically. Instead, the crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. need to be instructed to visit that page directly.

With search engines, this can usually be done through the search engine’s own maintenance tools.

You can also use something called a sitemap.xml file, which is a structured list of all the web pages on any given site. Sometimes crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. go through pages in the sitemap directly without having to follow links.

What if the links on the page are generated dynamically?

For efficiency, many crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. simply load the HTMLHyperText Markup Language (HTML) is used to describe how web documents should be rendered by the web browser. When you navigate to a web page, the web server hosting that page serves your browser an HTML source file that is then rendered into the visible and interactive web page. source file from the web serverA machine connected to the World Wide Web, which is purpose-built to respond to HTTP requests from clients and for sending resources in response. and parse its HTMLHyperText Markup Language (HTML) is used to describe how web documents should be rendered by the web browser. When you navigate to a web page, the web server hosting that page serves your browser an HTML source file that is then rendered into the visible and interactive web page. markupIn web technologies, markup is the process of "marking up" a document with tags or similar codes. This process can be formalized in languages like HTML or XML. for content and links.

However, a huge portion of the web is generated dynamically with JavaScriptJavaScript is the main language of the dynamic web. The web browser renders the HTML source file into a dynamic document that can be interacted with using JavaScript. after the HTMLHyperText Markup Language (HTML) is used to describe how web documents should be rendered by the web browser. When you navigate to a web page, the web server hosting that page serves your browser an HTML source file that is then rendered into the visible and interactive web page. source file is loaded.

In recent years, search engine crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. have developed the ability to execute and renderRendering happens when the web browser starts to convert the HTML document and its associated resources into the dynamic document the user sees and interacts with when visiting a web page. JavaScriptJavaScript is the main language of the dynamic web. The web browser renders the HTML source file into a dynamic document that can be interacted with using JavaScript.. This means that in most cases, the crawlerA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. should end up seeing what regular web visitors see.

However, if content is behind complicated interactions such as login or purchase flows, the crawlerA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. most likely will not be able to replicate these and thus any such content will not get crawled.

How often does a crawlerA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. recrawl the content?

Naturally, content needs to be recrawled periodically. Think of a site like Wikipedia where pages are being constantly updated. If the crawlerA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. visited Wikipedia just once per month, for example, its index would be severely outdated at all times.

Content is recrawled periodically, but not all content is recrawled equally.

Example

Your little blog site might only get recrawled once a month unless you specifically instruct the search engine to revisit the site. But a popular site like Wikipedia or Amazon will get recrawled far more often, because it’s in the interests of the search engine, too, to make sure the metadataMetadata is additional data about the data itself. For example, in an analytics system the "event" describes that action the user took, and metadata about the event could contain additional information about the user or the event itself. represents the content accurately.

In general, what’s good for visitors is good for search engine crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index.. If you can make sense of the website’s structure and information hierarchy with ease, crawlerA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. bots most likely can, too.

Deep Dive

How to improve the crawlability of a page

From a technical SEOSEO focuses on improving a website's visibility and reach for users who perform relevant queries in a search engine. perspective, you can make the crawlers’ work easier with the following:

Even though many search engine crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. are JavaScriptJavaScript is the main language of the dynamic web. The web browser renders the HTML source file into a dynamic document that can be interacted with using JavaScript. literate, it might still be a good idea to avoid using JavaScript-generated content (especially links!) as much as possible.
Use links generously, both links to other pages on the same site and links to other websites. By having a healthy navigation and linking structure, you give positive signals to crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. and make their decision tree a simpler one.
Make it easy for the bot to determine the metadataMetadata is additional data about the data itself. For example, in an analytics system the "event" describes that action the user took, and metadata about the event could contain additional information about the user or the event itself. of a page. Use structured dataStructured data is a semantic layer on a website, explaining what the website is about and what different entities on the page are. This information is usually provided for bots and crawlers to consume., clear headings, relevant metadataMetadata is additional data about the data itself. For example, in an analytics system the "event" describes that action the user took, and metadata about the event could contain additional information about the user or the event itself. tagsNormally, tag references an HTML element (or node). In a marketing context, tags are used to denote HTML elements and JavaScript snippets specifically designed for collecting data to marketing vendors., and make sure you don’t unintentionally block content from crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index..
Avoid duplicating the same content on different pages, because crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. will have a hard time figuring out which content they should add to the index.
Use a clear URLUniversal Resource Locator, the main method of encoding internet addresses for web browsers to send requests to. structure for your pages, because that is often another helpful signal for crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index..

Ready for a quick break?

Now’s a good time to take a small break – walk around for 5 minutes and have a glass of water. 😊

Building the index

The data that crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. collect is used to populate the search engine’s index.

The purpose of the index is to associate the content with the keywords and key phrases that users tend to search for.

The clearer it is to the search engine what your content is about, the better it will be able to index the content.

Indexing is a multistep process. Search engines need to build their indexes so that they are both comprehensive and fast when facing a multitude of different query types.

Index creation involves building a huge “table of contents” for the web, where the processed content is stored across enormous databasesStructured storage for data that usually serves a singular purpose. For example, a company's financial records would be stored in a database. on gigantic server clusters. This index includes information about what content appears on which pages, how often, in what position, and in what context.

Example

The index needs to also include metadataMetadata is additional data about the data itself. For example, in an analytics system the "event" describes that action the user took, and metadata about the event could contain additional information about the user or the event itself. about link analysis. The core of many search engines is the link graph, which measures which sites link to which, and what the value of these links are, based on site authority. A link from a well-known information site like Wikipedia is more valuable than a link from a regular blog.

The index needs to handle duplicate content so that it would always contain unique and relevant information. Duplicate content often dilutes the value of strong, unique content pieces.

Just like crawling, the content needs to be reindexed constantly. This is because the web is a living thing, and a single piece of content might change its position in the index simply by acquiring new links from new websites.

While search engine results are not personalized to the individual user, they are often localized (based on variablesVariables are (usually small) pieces of code run in a TMS to fetch dynamic values for tags when they fire. A defining feature of variables is that they are re-evaluated whenever a tag fires. For example, if a variable fetches the exact time when a tag fired, it's important that it doesn't use the same, fixed value for all tags on the page. like the user’s current location), and they might incorporate data based on the user’s previous browsing behavior. MetadataMetadata is additional data about the data itself. For example, in an analytics system the "event" describes that action the user took, and metadata about the event could contain additional information about the user or the event itself. such as the geographical relevance of content needs to be encoded in the index, too.

Deep Dive

Processing of content for indexing

Content needs to be parsed for meaningful information. This includes things like the title of the page, its meta tagsNormally, tag references an HTML element (or node). In a marketing context, tags are used to denote HTML elements and JavaScript snippets specifically designed for collecting data to marketing vendors., main body of content, links within the content and so forth. If the content includes multimedia content or structured dataStructured data is a semantic layer on a website, explaining what the website is about and what different entities on the page are. This information is usually provided for bots and crawlers to consume., this is also parsed.

The text content needs to be processed so that it can be logically organized. Instead of storing complicated sentence structures or semantically similar content, the text is tokenized (broken down into smaller constituent parts), normalized (standardized in form), stemmed (extracting the base part of a word, e.g. “running” -> “run”), and stop words like “and”, “the”, “an” are removed because they are rarely relevant for search.

Don’t miss this fact!

Building and maintaining a search engine index is a very complex task. It requires enormous amounts of storage and processing power to ensure adaptability to changes in the index and fast retrieval times for queries.

Querying the index

The principle of a search engine sounds simple: for any given query, fetch the most relevant content.

However, deciphering what the user’s search intent was with the query and coupling that with content that hopefully matches that intent is an extremely difficult task.

Consider a search engine query like this:

Samsung dishwasher

What is the user’s intent?

Are they looking for cheap deals for the appliance? That would be a commercial query.

Are they looking for support documentation for their dishwasher? That would be an informational query.

Are they trying to find Samsung’s product page for their dishwashers? That would be a navigational query.

In this case the query itself is lacking any type of signal about the user’s intent, so the search engine needs to do a lot of guesswork.

Even if the user’s intent is clear(er) from the query, the search engine still needs to determine which pages to surface first, or how they rank the results the user sees.

Getting your content ranked for high-volume queries is one of the key activities in SEOSEO focuses on improving a website's visibility and reach for users who perform relevant queries in a search engine., because research shows that users tend to be only interested in the top search engine results page entries.

It’s a common joke that if you want to hide something for good you bury it in the second page of search engine results, because no one ever ventures there.

Deep Dive

Ranking in search engine query results

The ranking algorithms that a search engine uses are among the highest guarded secrets of the industry.

SEOSEO focuses on improving a website's visibility and reach for users who perform relevant queries in a search engine. researchers spend hours upon hours researching patents from search engine companies to figure out what technologies they use for ranking content for any given query.

A search engine like Google has hundreds upon hundreds of variablesVariables are (usually small) pieces of code run in a TMS to fetch dynamic values for tags when they fire. A defining feature of variables is that they are re-evaluated whenever a tag fires. For example, if a variable fetches the exact time when a tag fired, it's important that it doesn't use the same, fixed value for all tags on the page. and features in their ranking algorithms. The most industrious SEOSEO focuses on improving a website's visibility and reach for users who perform relevant queries in a search engine. experts work with correlation studies to see how the top results for different queries change over time, with hopes of understanding how these algorithms change.

To improve the chances of ranking well for your content, there are certain things to keep in mind:

Make sure that the content is fresh and up-to-date.
Think of the queries that the content could help with and design the content around these topics.
Avoid duplicate content.
Make sure the content is crawlable, properly structured, and with comprehensive metadataMetadata is additional data about the data itself. For example, in an analytics system the "event" describes that action the user took, and metadata about the event could contain additional information about the user or the event itself..
Share the content actively to increase its chances of getting inbound links from other, related content.

In addition to content signals, user behavior impacts search engine results, too. SERPThe search engine results page is where you'll see a listing of sites and ads once you run a query in a search engine. is a living thing – the same query done by two different users often yields different results. This is because users interact with SERPThe search engine results page is where you'll see a listing of sites and ads once you run a query in a search engine. results in different ways, signalling their query intent with varying patterns.

In the end, from a technical point of view, the categories of “What’s good for your site visitors” and “What’s good for search engines” overlap a great deal. If you want your content to be crawlable, indexable, and discoverable, you should write it so that it is clear to both your visitors and to search engines what the content is about and how it relates to similar content on your website and elsewhere on the web.

Key takeaway #1: Crawlers eat links

A web crawlerA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. needs a starting point – some resource on the web that they download and parse for information. If that resource is a web page, they can then follow any links on the page to find additional content to crawl. This way they could theoretically crawl the entire interconnected web. If a page has no links pointing to it, it cannot be crawled unless crawlersA machine that downloads and parses content on the web. It follows links to find additional content to fetch. Search engine crawlers use the parsed information to build the search engine index. are specifically instructed to do so.

Key takeaway #2: The index is the content database of the web

A search engine’s index is where all the crawled content is parsed, organized, structured, and stored. There can be many different indexes, such as an image index, a video index, and a mobile-only index. The purpose of the index is to make the content of the web available for speedy query and retrieval, so that a search engine can respond to a user’s query without any delay.

Key takeaway #3: Search engines try to fetch relevant content

Search engines try to fetch relevant content for any given query from their index. This is sometimes very difficult to do based on the query alone, which means that search engines might make use of other signals, too. For example, previous queries the user made might inform future queries, too, and the user search preferences might impact what the search engines determines to be relevant content when the query itself is ambiguous.

What did you think about this topic?

Thanks for your feedback!

Crawling the web

Example

Deep Dive

How to improve the crawlability of a page

Ready for a quick break?

Building the index

Example

Deep Dive

Processing of content for indexing

Don’t miss this fact!

Querying the index

Deep Dive

Ranking in search engine query results

Key takeaway #1: Crawlers eat links

Key takeaway #2: The index is the content database of the web

Key takeaway #3: Search engines try to fetch relevant content

What did you think about this topic?

Unlock Premium Content