In the past, we have written about site indexing and the changes that Google introduces to indexing mobile and then desktop site versions. This text could be considered as a sequel to this article that discusses the term “Crawl Budget”. To make things clearer, we will begin by explaining how indexing works in general.
How Does Google Index Pages
A Google program called Googlebot searches (we could also call it a comb) searches pages following their position in its list of sites to view. Every link the Googlebot comes across is put on that same list and all those links also come in for a comb.
This process is called Crawling (literally, crawling, because Googlebot is also called a spider). Of course, Google does not do this one page at a time, but it is done in many parallel Googlebot connections.
All the information that Googlebot picks up from a single page is logged into its database, called index, and used to enable Google to display results for the required search.
You can read in this article how to crawl a new page faster and easier, as well as how to check if a page is indexed.
Crawl budget determines how many pages Google can and wants to view on your site in a single “rush” or during one crawl. The Crawl budget is different for each site and depends on the Crawl rate limit and Crawl demand.
The crawl rate limit is the number of simultaneous parallel connections that Googlebot uses while combing your site. Google does not want to burden your server, nor to waste its resources and the Crawl rate limit serves exactly this purpose. The Crawl rate limit depends on the limit set in the Search Console and the speed of your server.
Crawl demand, on the other hand, depends on two factors: site popularity and how up to date the site is. Let’s put it this way, on the one hand, you have a site like Airbnb.com, which always has new pages and new information and is very popular and there certainly is some new local site with new information but less popularity. Certainly, the Airbnb site will be crawled more often than this less popular portal, with a higher potential crawl budget.
On the other hand, let’s take an average business presentation of a dentist’s office, which is still the same today as it was 2 years ago, and against it a site that would display results and the current Euroleague basketball table. On the one hand, let’s have a look at an average dental office website and an up-to-date Euroleague website (with new results and table changes once or twice a week), which sends a message to Google that it needs to be updated as often as possible.
Why is a crawl budget important?
Crawl budget is not a particularly important thing for most sites. It is not a ranking factor and will not affect your keyword ranking. Also, most sites will never spend their crawl budget. The truth is, the Crawl budget can only begin to interest you when your site has approximately six digits of indexed pages and up.
What if you have a site with millions of indexed pages?
While it may seem strange to someone that a site has a million indexed pages, it is quite a common thing for larger online shops, news portals, and similar sites. For example, the aforementioned Airbnb has 11.6 million indexed pages, while, say, eBay has 270 million and so on.
All of these sites should be mindful of how to get the most out of your crawl budget. The fact that a site has a bunch of indexed pages does not necessarily mean that it has a problem of this kind. Maybe Airbnb has 11 million different accommodation units, and those sites are indexed as well. The problem is most often perceived when you see a site that has a certain number of pages and at the same time has a multiplied number of indexed pages, which is the case with almost all major domestic and foreign online stores.
The main problem is that with such sites, crawl budget instead of useful pages can be spent combing pages that we do not want to be indexed at all such as hacked pages, pages created by various filters and the like.
How to optimize your crawl budget?
What you want in situations like this is for Google not to comb through these unwanted pages but to use its resources to comb useful pages, that is, the ones you want to be indexed. Although you do not want certain pages to be indexed, adding no-index or canonical tags will not be enough, because Googlebot will still comb those pages.
What you can do in these cases is to make all the links within your site that link to such sites “nofollow”. Often, removing particular navigation or filter on the site will be a recommendation or even a necessary thing, and as a potential solution, all the spam of these unwanted pages can be entered into a robots.txt file. Certainly, the most difficult task, in this case, would be to identify the pages that you do not want to be crawled. Also, remember that server speed and page read speeds affect the crawl budget. Page speed is also a ranking factor on Google, so be sure to pay special attention to this.