The first episode in the new series of the Click Intelligence Notebook focusing on “Business Owners Guide To SEO”. In the first episode James Owen talks about Screaming Frog, the desktop based web crawler.
Free download of Screaming Frog
00:00 Hi everyone, and welcome to the new series of the Click Intelligence Notebook. In the new series, we’re focusing on the business owner’s guide to SEO.
In this video, I’m going to be talking about Screaming Frog.
Screaming Frog is a desktop-based web crawler that allows you to crawl, explore and, investigate website URL data. This video I’m going to be talking to you about the basic feature of Screaming Frog, along with some advanced features. So, let’s jump right into it.
Screaming Frog has two license levels. We’ve got a free version and the paid version. The free version allows you to crawl up to five-hundred pages, but doesn’t let you get involved in any interesting configuration in their settings. So, it’s great to get started with if you’re just starting out your journey in SEO or online marketing, but you’re going to need a full license, the paid version, if you want to get the advanced features.
In this video, I’m going to be looking at the paid version. For this example, I’m using pcworld.co.uk. To start crawling, you cut and paste the URL you are about to crawl into the search box and click start. After a few seconds, you will start to see some of the URL has been crawled, and the data has started to be collected. This is kind of the main kind of dashboard, and kind of the background of what you are going to see throughout the current process.
So, let’s talk us through this. So, at the top navigation, we’ve got address, content, status code, status type, title one, title one length, title pixel width, meta description, meta description length, and so on and so forth. We’ve got keywords also, we’ve talked about H1s, H2s, and also meta robots which, again, are very important, and canonical information. And down to here, any other redirect information, redirect URL.
On the right-hand side we have total of the data type we are collecting, and just below, if you highlight a URL, you are then presented with the information here in terms of 301s, and any other information that Screaming Frog has been able to pick up. So, we’ll stop it right there, as we’ve got some data to play with here.
So, at the moment, we’ve got this as internal only, so this is only going to show us URLs that are internal URLs of that domain. So, only internal URLs that start with pcworld.co.uk/. If we want to see what URLs are being crawled then pointing to other websites, as you can see here, PC World are using pcworld.cdns, and .dixons.com, which is going to be an image, and so good use of cdns there for page load and speed.
If you go down here you’ve Curry’s, and you get the right idea, you can see exactly where they’ve got links pointing from their domains to other domains. Perfect.
Got protocol, response codes, URL, page titles if you don’t want to see that, meta descriptions, meta keywords, H1s, H2s, images, Hreflang, Ajax customer list and search console, if you wanted that information.
You can actually also export this information, literally just this table you can see on screen at any time by clicking on the export button here.
So, let’s go down and start looking at a bit more of the advanced features. So at the moment I’ve got this as a straight crawl. I’m not asking Screaming Frog to do to anything special. I’m just literally doing a very basic crawl. So what happens let’s say if we don’t want to view images, we don’t want to view any of the files, but just the raw HTML, kind of landing pages that would view on a website.
So, to do that, we’ve got to make sure we have a sitemap. So, PC World kindly have enough have a sitemap, and so we’re going to clear this here, and what we do here, we go to mode, list, and then we have an option here, so from file, enter manually, paste, download sitemap, download sitemap index.
PC World don’t actually have a sitemap index; they just literally have a download sitemap. URL, which is this. So, in this scenario, let’s say you did have lots of indoor sitemaps for your categories, you will then be able to literally select a category sitemap and put it in here to crawl a certain category you wanted to crawl. PC World don’t have that feature for a sitemap, so sitemaps aren’t split into a granular fashion, so we would literally have a sitemap.xml URL, and we would click on this, and it literally brings me twelve-thousand nine-hundred and thirty-four URLs that are in that sitemap. This is literally just for sitemap URLs; this is not the images or any other files, comment file, URLs.
So, we’re going to go like that again. If I was PC World I could easily switch on my sitemaps. See some 301s there, some 302s and some 404s in there. As everyone knows, you will want to make sure you only have two two, sorry, two two-hundreds, sites go two-hundreds where you can do.
So this is a good example of kind of how you would, y’know, give structured information for Screaming Frog to crawl. Another example would be where, let’s say, you’ve got, y’know, you’ve got your sitemap, like so, let’s move it over, in that Excel format and you only wanted to, let’s say, crawl, let’s say, top hundred or something. So, you’d select the top hundred. Well, there’s probably more than a hundred now. Okay, it doesn’t really matter. A hundred odd. Let’s close that down.
I’m going to stop this again, clear. I’m going to upload. We’re going to go paste this time. I’ve just pasted three-hundred or four-hundred URLs from Excel, and, so you can see, that might be useful, let’s say you’ve got all your download crawl and you start splitting out all your URLs into sitemaps yourself. You do that and you kind of click okay. Let it crawl the three-hundred. It might take a bit of time. I’ll show you in a minute, in a little while, how to actually speed up Screaming Frog so you can do this process. So, let’s say that’s fine, we can stop there.
Let’s say you want to create a sitemap out of that data now. You click on sitemaps, create xml sitemap. Yeah, that’s fine. Settings is cool. Desktop, sitemap. Yeah, let’s call it one, for instance, or you could do it as product number pages. Whatever you want to call it. And then you go Save. And that then goes on desktop in a nice, clean xml format. HML file for you, in a xml file for you. So, that’s really useful.
Let’s just jump into some additional features here. So let’s go over to clear, mode, spider. Let’s just grab PC World again, because I want to start looking at, potentially, okay, we can do a normal crawl, but actually we can introduce some additional features. So, let that, get some data there, so let’s do configuration.
So, let’s start going over it again. Let’s rethink the configuration. Let’s just click on the other stuff again, because, we don’t really care, that’s fine. We can talk about limits as well, so Limit Search Total, Limit Search Depth, so if you only want kind of start, if you only want to go down three of four categories on the whole website, you can kind of conform it to do what you need to do. Max Length of the Crawl, Limit Max Folder Depth, Limit Number of Query Strings, and so on and so forth. Rendering, so you can change the rendering there. And then a lot more advanced, so Respect NorelsPorev, so, y’know, if you have got a really recent website, y’know, probably your standard features will do you really well. If you have got more than ten-thousand URLs, and you want to start getting really, kind of, advance with your crawling, then you would kind of delve into this and kind of make some configuration changes.
So, if I was you, I normally hang around in the basics. Play around a bit with my limits. A couple of clients I’d probably work on advanced, but normally it’s just basic limits will get the job done for you.
I think one recent example was one when I used the basic configuration, is where a client’s come along and said “Look, we’ve got a new website. It’s on a station server; Google can’t view it, but can you start with a migration job?” That’s, y’know, completely fine.
So, at that point, what you want to be looking at is, y’know, if it’s being blocked by Google, and you put it in Screaming Frog, you’re basically going to get nothing, it will be zero, because Screaming Frog won’t crawl anything back that Google can’t crawl. However, what you can do in here is to say that you want to not to follow ‘internal nofollow.’ And, y’know, not to follow ‘external nofollow.’ And basically click on to say that you actually want to follow it. So then, actually, at that point when you go and click on the start you can then basically, Screaming Frog will then ignore all the no follows and basically follow the whole URL. So that’s a really handy tip. So, when you have a station site, and you think “Well, I can’t crawl this because Google can’t crawl it”, well actually you can configure it so Screaming Frog can crawl it, and you can get all the data that you want and you can then export it, and from there you can start working on your kind of 301 migration document. So, that’s really useful. So, we’ve talked about kind of the basic limits for entering advanced kind of preferences within a site configuration.
Let’s jump into the speed, which is quite important. So, if you’ve got five-hundred URLs you are probably going to sell it off, go and get a cup of tea, come back, it’s done. If you’ve got more, let’s say between ten-thousand to a million URLs, it’s going to take some time. You have probably already, y’know, split out your sitemaps and then you work by crawling each sitemap as you go along. But if you want to speed things up, you can increase the amount of threads, so you can go to my fifty, and it will be a lot, lot quicker, and you’ll see down here… well, there you go, it speeds up nicely for you.
Again, of course, y’know, on an average website it should be fine. If there are any issues with the URLs or there’s lots of redirects, there’s lots of canonicals, it’s going to slow down the situation. But, there are things you can do with speed that I’ve said there. So, that’s useful.
So, you’ve got the speed, let’s go into HTTP header. So, let’s say you want to change the… I want to be Googlebot; I want to be BingBot, I want to be Slurp. So, you have the option to basically change your user agent, so you can check how your website and how it crawls to different user agents, so that’s useful.
And then last thing, I think, we want to discuss is probably the proxies, so if you feel that you’re kind of hitting a website quite often, you can always put the address and that kind of AIP address in there, as well.
Other things to kind of note on this is the fact that it’s pretty much unlimited up to a point. If you’re on the website and it has kind of ten million URLs, you’re going to split down the sitemaps within this to be able for Screaming Frog to crawl. It will have issues with memory, but there is a really good FAQ section on the Screaming Frog website that tells you how to increase your memory, but you will have to save and then export and then restart from where you started before, so basically your PC doesn’t freeze.
So, I think that is kind of the advanced features and basic features of Screaming Frog. It’s a great kind of crawling piece of software. I use it on a daily basis. It’s great for finding kind of major problems, and, yeah, it saves me a huge amount of time.
So, thank you for listening, and we’ll catch you on the next video. Thank you. 15:13