Gary Illyes published an article in January on the Google Webmaster Central Blog about What Crawl Budget Means for Google Bot, which brought a lot of attention to the topic of a site’s crawl budget. During the same week, I started recovery work on a site because the crawl budget had been decreasing since November 2016.
Using the latest news on crawl budget and data from Google Search Console, I was able to track down the problem to an increase in 404 errors that was creating poor crawl health and dragging down the site’s crawl rate limit.
What is a Crawl Rate Limit & Why Does it Matter
Googlebot crawls the web to discover pages to store in Google’s index. If your pages aren’t crawled by Googlebot, then searchers won’t be able to find them, since only pages stored in the index are returned by a user’s query.
Every site is given a crawl rate limit, which limits the maximum fetching rate for a given site — or the amount of data that Googlebot will crawl. You want Googlebot to crawl the maximum number of pages possible, so more of your site’s content with ranking potential is indexed and can compete to rank in the SERPs.
How to Diagnose Crawl Rate Limit Problems
Crawl rates fluctuate naturally for a few different reasons:
- Someone manually set a crawl rate limit in Google Search Console
- Add or removed subfolder from ROBOTS.TXT file
- Add or removed large number of pages on your site
- Anecdotal and often unexplained spikes in Googlebot crawls on your site
If you see a dip in your crawl rate, make sure one of the three hasn’t happened by checking your Search Console settings, verifying the contents of your ROBOTS.TXT file, and checking for an increase in HTTP errors.
Verify Crawl Rate Limit Settings in Google Search Console
1. Select the SETTINGS GEAR in the upper right-hand corner.
2. From the dropdown menu, select SITE SETTINGS.
3. Under Crawl Rate, verify that LET GOOGLE OPTIMIZE FOR MY SITE is selected.
Common Issues with ROBOTS.TXT Files That Hurt Crawl Rate
To view the contents of your robots.txt file, navigate to your root domain with the ROBOTS.TXT filepath — MYDOMAIN.COM/ROBOTS.TXT.
A properly-structured ROBOTS.TXT file is a key component of any site, especially sites with hundreds, thousands, or even millions of pages. To prevent unwanted changes to your ROBOTS.TXT file, make sure that you limit the number of people who have access to editing the document.
If, for some reason, you or someone on your team made a change to the file that hurt your crawl rate limit then you it’s easy to revert the changes. You don’t have to be an SEO expert to read a ROBOTS.TXT file, all you need to know for this exercise is that DISALLOW means a robot will not crawl the filepath and ALLOW means it will.
Look for the disallow lines in your ROBOTS.TXT file to see if your whole site or large sections of your site have been blocked from robots, which would prevent Google from crawling them and reduce the total number of pages crawled. The most important ones to watch out for:
In the first example, someone would have blocked your whole site and in the second an entire subfolder of your site that could consist of a large number of pages. If either of these happens, all you have to do is remove the unwanted statement from your ROBOTS.TXT file and monitor your Search Console crawl analytics for an improvement in the number of pages and bytes crawled. Unfortunately, for the site I was working on, neither the site settings or ROBOTS.TXT file could explain the drop in the site’s crawl rate limit.
Case Study: How HTTP Errors Can Reduce Overall Crawl Rate Limit
The article on Google Webmaster Blog confirmed that crawl health influences a site’s crawl rate limit. In diagnosing the site, it became clear to me just how much crawl health matters for all sites, even smaller ones.
In the article, Gary Illyes of Google states:
“First, we’d like to emphasize that crawl budget, as described below, is not something most publishers have to worry about. If new pages tend to be crawled the same day they’re published, crawl budget is not something webmasters need to focus on. Likewise, if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.”
The keyword phrase here is most of the time — and I was looking for an explanation for the fraction of the time that this is not the case. The last place for me to look was at the site’s crawl health. In the Top Questions portion of the article, we learn:
“…a significant number of 5xx errors or connection timeouts signal the opposite, and crawling slows down. We recommend paying attention to the Crawl Errors report in Search Console and keeping the number of server errors low.”
In Crawl Errors, I found an increase in 404 errors, not 500s, during late October and early November 2016, right around when the crawl budget decreased.
The site was recently migrated to a new platform and some of the URLs changed during the migration and were now serving 404 errors. We implemented 301 redirects to the appropriate new pages and saw the crawl rate limit increase by 107% over 4 weeks.
At just under 500 pages, even small site crawl rates are influenced by poor crawl health. Aside from server errors, client errors like 404s also seem to have an impact on crawl rate limit.
Having a process for diagnosing crawl rate limit drops makes it faster to diagnose sitewide issues. By protecting who has access to both your Search Console account and your ROBOTS.TXT file, you can prevent any unwanted changes or isolate the changes that could have impacted your site.
Keep an eye on your Crawl Errors report by setting up a monthly process to check in on your site’s crawl health — which is great for both users and Googlebot.