Google search console training: Crawl budget and crawl stats

Team TypeStack
Team TypeStack ...
Dec 01, 2022  . 7 min read
shareshareshare
share
Table of Content

This blog will provide you with a short introduction to how Google crawls pages and definitions for terms such as crawl rate, crawl demand, and crawl budget. In addition to that, you'll learn about the Search Console Crawl Stats report, which provides data on crawl requests, average response time, and more.

The report contains information for several Google crawlers, but mainly we will focus on the search aspect of it. As a disclaimer, this article will be more relevant if you work with a large website. If you have a site with fewer than a few thousand pages, you can handle it.

However, it never hurts to learn something new, right? Who knows, your site may become the next big thing.

The crawling process begins with a list of web addresses and site maps provided by site owners. Google uses web crawlers to visit these addresses, read the information in them, and follow links on those pages. The crawlers will revisit pages already in the list to check if they have changed and also crawl new pages they discovered. During this process, the crawlers must make important decisions, such as prioritizing when and what to crawl, while ensuring the website can handle the server requests made by Google.

Crawl demand is how much Google desires the content. It is affected by URLs that haven't been crawled by Google before and by Google's estimation of how often content changes on the non-URLs. The successfully crawled pages are processed and passed to Google indexing to prepare the content for serving on Google Search results. Google wants to make sure that it doesn't overload your servers, as a good citizen would do.

With that in mind, Google computes your site crawl rate, which represents the maximum number of concurrent connections a crawler may use to crawl your site. This value is calculated by Google periodically based on your site's responsiveness-- or, in other words, how much crawling traffic it can handle. If the site is quick and consistent in responding to crawlers, the rate increases if there is demand from indexing. If the site slows down or responds with server errors, the rate decreases, and Google crawls less.

In rare cases where Google crawlers overload your servers, you can set a crawl rate limit using the crawl rate settings report in Search Console. Taking crawl rate and crawl demand together, we can define crawl budget as the number of URLs Google can and wants to crawl.

To find out how often Google crawls your site and what the responses are, you should use the Search Console Crawl Stats report, which provides statistics about Google's crawling behavior. Here are some questions you can answer with the data provided. What's your site's general availability? What's the average page response for a crawl request? And how many requests were made by Google to your site in the last 90 days? Let's dive in to learn more about the report. Log in to Search Console and find the Settings page, where you can open the Crawl Stats report.

Know that this report is only available for properties at the domain level. On the summary page, you'll find a lot of information. The main elements are the crawling trends chart, the host status details, and the crawl request breakdown. The chart shows the trends for three metrics. The total crawl request for URLs on your site, whether successful or not. Request resources hosted outside of your site are not counted. So if your images are served on another domain, such as a content delivery network or CDN, they will not appear here.

The total download size from your site during crawling. If Google casts a page resource that is used by multiple pages, the resource is only requested the first time. And the average page response time for a crawl request to retrieve the page content. This metric does not include retrieving page resources such as scripts, images, and other linked or embedded content. The response time does not account for page rendering time. Look for major spikes, drops, and trends over time in your data.

For example, if you see a significant drop in total crawl requests, make sure no one added a new robots.txt to your site. Or maybe your site is responding slowly to Googlebot. These might be understood as a sign your server cannot handle all the requests. Another example would be a consistent increase in average response time. This might not affect your crawl rate immediately, but it's a good indicator that your servers might not be handling all the load.

Ultimately, this may affect user experience, too. The host status is an easy way for you to check your site's general availability in the last 90 days. If you verified your website as a domain property, you would see all child hosts separately here. This can be very handy for evaluating all your hosts' performance in one place. When you click to get the host status details, you'll find three categories. Errors in this particular section mean Google cannot crawl your website for any technical reason. Robots.txt fetch tells you the failure rate when crawling your robots.txt file. Your site is not required to have a robots.txt file, but it must return the successful response 200 or 404 when asked for this file. If Googlebot has a connection issue, for example, a 503, it will stop crawling your site. DNS resolution tells you when your DNS server didn't recognize your hostname or didn't respond during crawling.

If you see an issue there, check with your registration to make sure your site is correctly set up and that your server is properly connected to the internet. Server connectivity tells you when your server was unresponsive or did not provide full responsibility for your URL during a crawl. If you're seeing spikes or consistent connectivity issues here, you might need to talk to your provider about increasing your capacity or fixing availability issues. Also, if Google found at least one of those errors on your site in the last week, you will find a warning alerting you of problems last week.

Dig deeper into the error and fix it. If Google found at least one error on your site in the last 90 days, but it occurred more than a week ago, you will see a warning that you had problems in the past. You should check your server logs or contact the developer to review what the problems were and decide whether you need to take any action.

And if Google didn't find any crawl availability issues on your site in the past 90 days, you're all green. The crawl request cards show several breakdowns to help you understand what Google crawlers found on your website. There are four available breakdowns. Crawl response shows the responses that Google received when crawling your site. Common response types would be 200, 301, 404, or server errors. Crawl file type shows file types returned by request. Common file types would be HTML, image, video, or JavaScript. Crawl purpose could be a discovery when a URL is new to Google or refresh for a re-crawl of a known page. And Google type shows the user agent used by Google to make the crawl request. For example, smartphones, desktops, images, and others. Click any row to drill down to that value and review its strengths and specific sample URLs.

Conclusion

In a nutshell, the crawl budget is the number of URLs Google can and wants to crawl on websites every day. It is relevant to large websites where Google needs to prioritize what to crawl first, how much to crawl, and how frequently to re-crawl. To help you understand and optimize Google's crawling, you should use the Search Console Crawl Stats report. Use the summary page chart to analyze crawling volume and trends. Use the host status details to check your site's general availability. And use the crawl request breakdown to understand what Googlebot is finding when crawling your website. Hopefully, you'll find this article helpful.

success