Search engine spiders visits

"How do I know that my website is visited from search engine spiders, or when my web page is visited from search engine spider / crawler / bot"? You can find this question almost everywhere. Here is explained how to pull out that data from your website/server log.

Step One - How to recognize search engine spiders

Search engine spider visits could be find using simple search for User Agent (browser) in your website log. Use keywords: googlebot for Google, msnbot for MSN, slurp for Yahoo. That is enough for simple search, but we may not forget that website content scrapers are using fake User Agents to avoid blocking.

To confirm real search engine spider, we need to know IP or DNS (resolved IP):

Google**: 66.249.64.* to 66.249.95.*, crawl-66-249-* , *.googlebot.com
Yahoo: 72.30.* , 74.6.* , 67.195.* , 66.196.* , *.crawl.yahoo.net , *.inktomisearch.com
MSN/LIVE/BING*** :65.54.* , 65.55.* , msnbot.msn.com , *.search.live.com

MSN crawlers are known to playing games. From these IPs, stealth crawling (fake user agent), and referral spam (fake referrer from now BING search engine) are nothing unusual.

IP ranges above are for example, first check WhoIs for other spotted IPs.

Step Two - start with programming

First we need a simple text database (3 bases for one search engine), google spider is used for this example:

googlenew.txt contains two fields: page (file) name and date
googleindex.txt contains more fields (your choice), in this example - page name, and two last dates (older and newest date)
googleips.txt, contains known Google spider IPs (shorted) and DNS (see above), one in a row

Step Three - main script / program

Script or program is in fact simple. User agents are used as keyword for searching website log. When matched (in this case googlebot), then it needs to be confirmed looking in googleips.txt (google spider IPs). When confirmed, name of page is added, along with date from log to googlenew.txt

During this searching / scanning, warnings could be displayed - for example .User Agent correct and IP different could mean that scraper or somebody other is scanning your website using googlebot User Agent, or googlebot is spidering from different IP. Good to check whois for IP in question to be sure.

An other warning could be displayed when server response code is different. Say, instead of code 200 (O.K.), could be 404, what means that googlebot is trying to request page not residing on your website. 206 means partial content, you should check in search engine cache where is problem with that page (where spider stopped with parsing, maybe there is some problem with code )

As you see, this scanning could be used to spot errors too. After log scanning, pages with correct response codes are added to googlenew.txt and duplicates (if any) deleted. If duplicates exists (same page and date), warning could also be displayed, so you can check why that page is visited two or more times at the same day.

Next step is to compare googlenew.txt with googleindex.txt . As mentioned above, googleindex.txt contains last two or more dates (three recommended) when page was visited from googlebot. Comparing these two bases, replacing newest date in googleindex.txt with newest date, moving present newest date to the field with older date, in googleindex.txt you will have at least two dates when web page is visited from search engine spider.

For fresh new page, first visit is from crawler, checking page, second visit, usually the same, or the next day, is indexing robot. If second visit is missing, something is wrong with page (duplicate or errors). Same is with modified page, two visits in a short time frame, that is why at least three dates are recommended.

Additional website log scanning

List of IPs and DNS could be used for additional scan of weblog, to find visits from search engine IP range, where spider User Agent is NOT used (different User Agent). On this way, you can spot human review visits or changes in spider User Agent. Anyway, early warnings if you are playing in "black hat".

Way to use collected data

Collected data could be used in a different ways. In googleindex.txt (or any other search engine base) are stored at least two dates (you can set more) when pages are visited from googlebot. You will be surprised when you see that some pages are spidered on a daily basis, while some others are spidered once at month, some even once at year.

Why this it that important? When you are updating content on website, sure that you want that indexed or new page is updated in search engine index, what fast is possible. In this case, looking in googleindex.txt database, you can spot frequently visited pages, and according to that, add link there to the new page.

Question is raising in your head, I can guess. Why to add link to the new page on the internal page, when main page is spidered frequently and I can add link there? Here is an answer. Your website main page is spidered AND monitored, but, not only from search engines, your interesting content can attract content scrapers, and then what?

Your new web page could be scraped and served to Google spider before google spider visits your website. Now, you could have your new (original) page filtered as duplicate content. Call me paranoid, but I am talking from (bad) experience.

Back to the theme. googleips.txt (and any other) spiders IPs database could be shared with bad bots hunter program,
see Find bad bots page.

Old pages could be still indexed and visited often. Instead of adding new page and waiting to be included in search engine database, why not to use that old, already indexed page and add / replace content there? Just look at googleindex.txt and find some suitable page (often visited) for that purpose...

Conclusion

To summarize, this is what you can find:

Missing pages or problems- 404, 500, 206 errors
Fake spiders to block
Most important pages - short time between visits (at least three, newer dates)
Less important pages - more than month between visits (in supplemental search engine index, probably)
Orphan pages - no visits from spider - not in search engine index - no links to that page?

to top