🥳 You have completed all topics in the Handbook. Click here to claim your certificate!

1. Crawling, indexing, and querying

A search engine for the web needs to crawl content independently to build its index. This index is then made available to queries.

When you use a search engine like Google or Bing, you are working with an interface that sits in front of an enormous corpus of text, video, sound, and other media.

Search engines send a never-tiring army of crawlers around the web, whose job is to collect information about the web pages they visit.

This information is stored in huge databases, where the content is further parsed for keywords, structured data, topics, headings, and other metadata. These databases form the index of the search engine.

When you type words into the search engine input bar, your text is converted into a query which looks for relevant content in the search engine’s index.

Figuring out which content matches the query, which content is most relevant, and which content is returned first to the user (these three things might not always correlate!) are all guarded secrets for search engines to prevent content from being manipulated in hopes of better search engine reach.

In digital marketing, improving content availability in search engines is a strong strategy for improving brand awareness and generating leads. This type of organic visibility is typically steadier and more consistent than the type of traffic you’d get from ads and social media nudges.

The discipline of adjusting a website’s technical infrastructure, content, and relationships with other websites is known as Search Engine Optimization (SEO). 

When the focus is primarily on making the website easier to crawl and index, we usually talk about technical SEO.

Crawling the web

The basic operational principle of a web crawler is simple.

The web crawler loads a web page, collects as much information and metadata about the page as possible, and then it adds the hyperlinks it finds on the page into a queue. This queue is then organized based on perceived importance of each link.

When the crawler visits the next page in the queue, it again collects the metadata and then follows the links on that page, and on and on it goes.

Crawlers thus follow the interconnected structure of the World Wide Web without pause. 

It’s not as easy as it sounds, though. Consider the following situations:

Can a web page prevent crawlers from crawling it and indexing it?

Yes. There are tools like the robots.txt file and meta robots directives that can be used to guide crawlers into what they can or cannot crawl and index.

What if a page has no inbound links? How can crawlers find it?

If the page is not linked to by any other page on the web, then crawlers will not be able to find it organically. Instead, the crawlers need to be instructed to visit that page directly. 

With search engines, this can usually be done through the search engine’s own maintenance tools. 

You can also use something called a sitemap.xml file, which is a structured list of all the web pages on any given site. Sometimes crawlers go through pages in the sitemap directly without having to follow links.

What if the links on the page are generated dynamically?

For efficiency, many crawlers simply load the HTML source file from the web server and parse its HTML markup for content and links.

However, a huge portion of the web is generated dynamically with JavaScript after the HTML source file is loaded.

In recent years, search engine crawlers have developed the ability to execute and render JavaScript. This means that in most cases, the crawler should end up seeing what regular web visitors see.

However, if content is behind complicated interactions such as login or purchase flows, the crawler most likely will not be able to replicate these and thus any such content will not get crawled.

How often does a crawler recrawl the content?

Naturally, content needs to be recrawled periodically. Think of a site like Wikipedia where pages are being constantly updated. If the crawler visited Wikipedia just once per month, for example, its index would be severely outdated at all times.

Content is recrawled periodically, but not all content is recrawled equally.

Example

Your little blog site might only get recrawled once a month unless you specifically instruct the search engine to revisit the site. But a popular site like Wikipedia or Amazon will get recrawled far more often, because it’s in the interests of the search engine, too, to make sure the metadata represents the content accurately.

In general, what’s good for visitors is good for search engine crawlers. If you can make sense of the website’s structure and information hierarchy with ease, crawler bots most likely can, too.

Deep Dive

How to improve the crawlability of a page

From a technical SEO perspective, you can make the crawlers’ work easier with the following:

  • Even though many search engine crawlers are JavaScript literate, it might still be a good idea to avoid using JavaScript-generated content (especially links!) as much as possible.
  • Use links generously, both links to other pages on the same site and links to other websites. By having a healthy navigation and linking structure, you give positive signals to crawlers and make their decision tree a simpler one.
  • Make it easy for the bot to determine the metadata of a page. Use structured data, clear headings, relevant metadata tags, and make sure you don’t unintentionally block content from crawlers.
  • Avoid duplicating the same content on different pages, because crawlers will have a hard time figuring out which content they should add to the index.
  • Use a clear URL structure for your pages, because that is often another helpful signal for crawlers.

Ready for a quick break?

Now’s a good time to take a small break – walk around for 5 minutes and have a glass of water. 😊

Building the index

The data that crawlers collect is used to populate the search engine’s index.

The purpose of the index is to associate the content with the keywords and key phrases that users tend to search for.

The clearer it is to the search engine what your content is about, the better it will be able to index the content.

Indexing is a multistep process. Search engines need to build their indexes so that they are both comprehensive and fast when facing a multitude of different query types.

Index creation involves building a huge “table of contents” for the web, where the processed content is stored across enormous databases on gigantic server clusters. This index includes information about what content appears on which pages, how often, in what position, and in what context.

Example

The index needs to also include metadata about link analysis. The core of many search engines is the link graph, which measures which sites link to which, and what the value of these links are, based on site authority. A link from a well-known information site like Wikipedia is more valuable than a link from a regular blog.

The index needs to handle duplicate content so that it would always contain unique and relevant information. Duplicate content often dilutes the value of strong, unique content pieces.

Just like crawling, the content needs to be reindexed constantly. This is because the web is a living thing, and a single piece of content might change its position in the index simply by acquiring new links from new websites.

While search engine results are not personalized to the individual user, they are often localized (based on variables like the user’s current location), and they might incorporate data based on the user’s previous browsing behavior. Metadata such as the geographical relevance of content needs to be encoded in the index, too.

Deep Dive

Processing of content for indexing

Content needs to be parsed for meaningful information. This includes things like the title of the page, its meta tags, main body of content, links within the content and so forth. If the content includes multimedia content or structured data, this is also parsed.

The text content needs to be processed so that it can be logically organized. Instead of storing complicated sentence structures or semantically similar content, the text is tokenized (broken down into smaller constituent parts), normalized (standardized in form), stemmed (extracting the base part of a word, e.g. “running” -> “run”), and stop words like “and”, “the”, “an” are removed because they are rarely relevant for search.

Don’t miss this fact!

Building and maintaining a search engine index is a very complex task. It requires enormous amounts of storage and processing power to ensure adaptability to changes in the index and fast retrieval times for queries.

Querying the index

The principle of a search engine sounds simple: for any given query, fetch the most relevant content.

However, deciphering what the user’s search intent was with the query and coupling that with content that hopefully matches that intent is an extremely difficult task.

Consider a search engine query like this:

Samsung dishwasher

What is the user’s intent?

Are they looking for cheap deals for the appliance? That would be a commercial query.

Are they looking for support documentation for their dishwasher? That would be an informational query.

Are they trying to find Samsung’s product page for their dishwashers? That would be a navigational query.

In this case the query itself is lacking any type of signal about the user’s intent, so the search engine needs to do a lot of guesswork.

Even if the user’s intent is clear(er) from the query, the search engine still needs to determine which pages to surface first, or how they rank the results the user sees.

Getting your content ranked for high-volume queries is one of the key activities in SEO, because research shows that users tend to be only interested in the top search engine results page entries. 

It’s a common joke that if you want to hide something for good you bury it in the second page of search engine results, because no one ever ventures there.

Deep Dive

Ranking in search engine query results

The ranking algorithms that a search engine uses are among the highest guarded secrets of the industry. 

SEO researchers spend hours upon hours researching patents from search engine companies to figure out what technologies they use for ranking content for any given query.

A search engine like Google has hundreds upon hundreds of variables and features in their ranking algorithms. The most industrious SEO experts work with correlation studies to see how the top results for different queries change over time, with hopes of understanding how these algorithms change.

To improve the chances of ranking well for your content, there are certain things to keep in mind:

  1. Make sure that the content is fresh and up-to-date.
  2. Think of the queries that the content could help with and design the content around these topics.
  3. Avoid duplicate content.
  4. Make sure the content is crawlable, properly structured, and with comprehensive metadata.
  5. Share the content actively to increase its chances of getting inbound links from other, related content.

In addition to content signals, user behavior impacts search engine results, too. SERP is a living thing – the same query done by two different users often yields different results. This is because users interact with SERP results in different ways, signalling their query intent with varying patterns.

In the end, from a technical point of view, the categories of “What’s good for your site visitors” and “What’s good for search engines” overlap a great deal. If you want your content to be crawlable, indexable, and discoverable, you should write it so that it is clear to both your visitors and to search engines what the content is about and how it relates to similar content on your website and elsewhere on the web.

Key takeaway #1: Crawlers eat links

A web crawler needs a starting point – some resource on the web that they download and parse for information. If that resource is a web page, they can then follow any links on the page to find additional content to crawl. This way they could theoretically crawl the entire interconnected web. If a page has no links pointing to it, it cannot be crawled unless crawlers are specifically instructed to do so.

Key takeaway #2: The index is the content database of the web

A search engine’s index is where all the crawled content is parsed, organized, structured, and stored. There can be many different indexes, such as an image index, a video index, and a mobile-only index. The purpose of the index is to make the content of the web available for speedy query and retrieval, so that a search engine can respond to a user’s query without any delay.

Key takeaway #3: Search engines try to fetch relevant content

Search engines try to fetch relevant content for any given query from their index. This is sometimes very difficult to do based on the query alone, which means that search engines might make use of other signals, too. For example, previous queries the user made might inform future queries, too, and the user search preferences might impact what the search engines determines to be relevant content when the query itself is ambiguous.

Quiz: Crawling, Indexing, And Querying

Ready to test what you've learned? Dive into the quiz below!

1. Which of the following are useful guidelines for helping content rank better in search engines?

2. What should you focus on when optimizing content for search engines?

3. Why are hyperlinks so significant for search engine optimization?

Your score is

0%

What did you think about this topic?

Thanks for your feedback!

Unlock Premium Content

Simmer specializes in self-paced online courses for technical marketers. Take a look at our offering and enroll in one or more of our courses!