Assignments

Grep Data Baby

Grepping for Googlebot

Using the commands you've just read about can be used in a myriad of ways, but as we're SEOs and ranking in search engines is what we care about, I want to give a practical example of but one of the commands available to us.

Grep.

Grep is a command line utility that is very good at finding strings in text files. Wikipedia describes it as:

grep is a command-line utility for searching plain-text data sets for lines matching a regular expression. Its name comes from the ed command g/re/p(globally search a regular expression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. Grep was originally developed for the Unix operating system, but is available today for all Unix-like systems.

As SEOs we know there are four basic steps to rank.

Create Content for a Website,
Place that content live
Get Indexed by Googlebot
Hopefully rank :)

This small section is a simple way for you to see if and when, Googlebot last visited your site once that content is live. We'll then expand it a little further turning it into a small shell script that will be run daily to email you about how well you've been visited by Google's spiders.

So without further ado...

log files.

Depending on how your server is configured the log files for your website may be placed in a different directory, or even called something different to name I am using using in this chapter

However, with your new command line fu, and some common sense you should be able to find where they are. The file may have the name of your domain in the filename such the.domain.name.log or www.the.domain.name_month_year etc.

You'll need to use that magic ingredient that all SEOs have... common sense and detective skills to find the naming convention and/or location on your server :)

try typing:

username $ find access.log

or username$ locate access.log

or simply looking for a directory called logs in your home directory.

I'm positive you'll find them quickly and easily enough, but if not.... keep on searching :)

Normally logs live in the /var/log directory so we'll be looking in there for the access.log file, which is where Apache will record all visitors to all the pages and assets on your website.

So...

username$ cd /var/log

username$ ls | grep "access.log"

You may have noticed the vertical straight line above, it's called a pipe and we covered it in an earlier section. What we're doing is running the ls command to list all the files in the /var/log directory and then using grep to take the results of that command and tell us if any of the file names contain the text string, access.log, in their names

So now we know that access.log exists in the /var/log/ directory, we need to find out of Googlebot has ever visited and if so when.

So let's use grep again

username$ grep "Googlebot" access.log

Hopefully we're going to get a lot of lines of data running down our screen.

Let's use a pipe again, to see how many times the a browser with the User Agent Googlebot visited us.

username$ grep "Googlebot" access.log | wc -l

wc is a command that counts. It literally stands for word count, and by adding the -l parameter it tells us how many lines are in the response.

If you had Googlebot visits, this command should tell you how many.

Grep itself has a count option and you can use it by using the -c parameter. Neither is right or wrong, you simply have options :)

The problem with simply checking on the user agent of the browsers visiting is not everyone who visits and says they are Googlebot is really from Google.

Keep this in mind when you check your data but also realise there are ways to confirm whether the bots are true spiders from Google or are simply pretending to be from Google, for whatever reasons they may have.

To check whether each visit is truly Googlebot we need to check the IP has a PTR record pointing to Google.

"A PTR record?", I hear you say.... what is that?

You can think of a PTR as the opposite of an A name record in the DNS system. An A record points a hostname to an IP address and a PTR points an IP address to a hostname.

If both an IP Address to hostname, and hostname to IP address match and within google.com or googlebot.com domains then it is a verified Googlebot web spider.

To verify this, we need to extract the IP of each of the visits that say they are Googlebot... To do that, we'll be using grep again

username$ grep "Googlebot" access.log | grep -o '[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}'

If Googlebot has visited you should get a long list of IP addresses cascading down your string.

If you copy one of them and run another command - this time we'll be using the host command, on the IP address. I am using a real IP address for Googlebot from my logs.

username:$ host 66.249.73.148

usernameju:~$ host 66.249.73.148

148.73.249.66.in-addr.arpa domain name pointer crawl-66-249-73-148.googlebot.com.

The PTR is included in the response. For this IP address the PTR is crawl-66-249-73-148.googlebot.com and as the domain is within either google.com or googlebot.com, we know that this could be a legiimate Google spider

To fully verify we now need to check the A Name record associated with the PTR - If it matches the IP address we extracted earlier.

username$: host crawl-66-249-73-148.googlebot.com

Which gives us the response of:

crawl-66-249-73-148.googlebot.com has address 66.249.73.148

As the PTR has returned the same IP address we initially checked.. WOOHOO we know that this bot is indeed a legitimate Googlebot.

But.... we can't go through line by line doing this manually, in the next page we'll be taking the commands we've just used and build upon them to create a Shell Script that will check through them all, do the verification, and email us the results daily in an automated manner.