AffiliateBeginnersGuide
Affiliate marketing for beginners

[an error occurred while processing this directive]
| RSS Feed | sitemap | Contact 

» Home » List of articles » How to block bad bots - how to protect web page content part 2


How to protect web page content - Part 2


How to block bad bots using .htaccess

Web Bots (or crawlers or web spiders or web robots). There are two kinds of them - good and bad.
Good bots are coming from search engines, indexing your content. Bad bots are scraping content from your website, spamming your Blog comment area, harvesting email addresses, sniffing for security holes in your scripts, trying to use your mail form scripts as relay to send spam email...


How to recognize bad bots?

Shortly, you need your raw site weblog and little knowledge how to analyze it. For a fast look, any weblog analyzer software can help, but for more details, you need to have detailed reports, what means special programmed software or "trained" eye.

How to identify bots?

Good bots (search engine spiders...)

  • They are "asking" for robots.txt
  • They are obeying directives in robots.txt (i.e. disallowed dir or file is not indexed)
  • Their identity is known (agent field - browser)
  • They are not requesting images
  • There is no referral string (no link to page from where they came from)

Bad bots (content scrapers, comment spammers, email harvesters...)

  • They are requesting many pages in a short time
  • They don't care what is disallowed in robots.txt, jumping straight there
  • They are not requesting images and css files (some yes, acting like normal browser)
  • There is no referral string (for some yes, see below)

Really bad bot is trying to act like normal visitor, to avoid various traps and spider detection software. What means, it is emulating real browser agent string, automatically adding referral string, requesting images and scraping in small steps, using some other IP (proxy) every visit, acting like normal visitor.

Although hard to identify, it can be catched tracking small signs. Browser agent name is usually the same, sometimes with mistakes, it is visiting daily requesting more pages than average visitor. Despite different IP every time, it could be spotted. Sometime referral string is wrong (referral page not linked to spidered page), sometimes is the same page as referral page in every request.

Every request without referral sting is suspicious. Although could be request from bookmark or typed in address bar, these are usually spider requests. Requests without referral string and agent are usually from some proxy.

Some bad bots are using Agent string of search engine spiders (GoogleBot, Slurp), hoping that all doors will be open to them. Bad mistake. Every experienced webmaster knows IPs of search engine spiders.

One more case where Agent string is showing search engine spider, and IP is different, is proxy hijacking. In this case you can lose your search engine position. Say, GoogleBot is requesting page from proxy site. Proxy website is pulling page from your site, serving your page to Google spider, but without "noindex" tag. In your web log, this is reflecting as GoogleBot visit, but from different IP. This happens when proxy is not set properly, or it is set (or used) to hijack pages from other sites, to hijack search engine position of those pages. Block that IP or play with it, delivering different content.


Sorting daily website log by IP or User Agent helps to find hidden bad bots. Analyzing hits on popular pages and/or sorting by IP, in a few days period, you can spot some suspicious requests, say, when your page is framed on other website, or scraper is refreshing your stolen page...


SEO software spiders and anti-virus tools

SEO (search engine optimization) tools. SEO software (robots) are trying to hide, using "normal" User Agents, to avoid to be banned from search engines when scraping listings, or from website, when scraping good positioned pages.

Don't be surprised, but your good positioned page could have far less "live" visitors than you think, despite referral string showing that they are coming from search engines. How many of them are requesting images? It is hard to believe that Firefox 3.0 is the text browser, isn't it?

Many users of those SEO software spiders are using default User Agent. When "fingerprint" is spotted, you can block them.

Connected with above kind of visits, some antivirus programs are using link scanners when visitors are searching on search engines. Software is checking every links displayed, and that is, in your website log, looking like spider visit. Problem here is, due to many users of that software, too many visits. If you block that User Agent, your website, instead of green, could be marked with gray color. Not good.

Some examples of Anti Virus tools User Agents:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)
- AntiVirus link scanner, no referral string, scanning links from search engine results.

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
- Very popular UA, used from various scrapers, spammers and other automated software, also indication of one AntiVirus link scanner.

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Some AntiVirus link scanner, this UA was also involved in some site content scrapping "business"

There are also some "doubled" UA, i.e. new UA added in middle of existing UA string...

Anyway, before blocking any unknown IP or User Agent, look at behaviour of that visitor. That is the best indication.


Waste of bandwidth category

In this category, along with these AV scanners and page prefetchers (browser addons), are various research bots, beta search engines spiders, image bots, homemade bots...


Referral log spam

Free way to get more links. That is for them. Full log of hits from some domains (usually .info), and there is no link to your site on those websites? Check your traffic stats report. Is it public? Can you find that traffic report searching on Google? Yes? Make your stats private (login) or ask your web hosting provider to do that. If you have Google toolbar installed, google spider will pickup new page (your traffic reports) if it is not protected. If report is displayed on search results, that is call to referral log spammers. Block (deny) them by domain, or subdomain, in referral field, or by IP if they are using the same IP.

If your website is on Apache server and you can use .htaccess, here is a little trick for playing:

RewriteEngine On
RewriteCond %{HTTP_REFERER} spammerdomain [NC,OR]
RewriteCond %{HTTP_REFERER} otherspammerdomain\.info [NC,OR]
RewriteCond %{HTTP_REFERER} spammer-domain [NC]
RewriteRule ^.*$ %{HTTP_REFERER} [R,L]

This code above will redirect f* spammer to domain in referral string. In other words, he will spam his own domain. For "spammerdomain", you don't need to use extension (.com, .info...), just domain name, as keyword. Be carefull here, visitor coming from domain where keyword is included in domain name will be redirected back. Less likely, but...

Some of them are using User Agent field to spam your weblog, adding link in that field. If you see some spammy url there, especially if it is live link (a href), that is a new IP to ban.


How to block bad bots

.htaccess can help, with or without spider identification script, to get rid of various bad bots.


Honey Pot

Simple trap for bad bots can be made very fast.
Although it is described in Part 1, protect web page content , here is again:

  • Make directory "trap", or some other name
  • Add to robots.txt:
    User-agent: *
    Disallow: /trap/
  • Upload index.html or any default page to that dir.
  • add SSI call to script what will record requests or analyze page requests later from website log
  • add hidden link from main page to that directory, use transparent gif or dot
  • Add bad bot IP to .htaccess (block it)

Normal visitor will not click on that hidden link, bad spider will fall right into the trap


Spider detection scripts

Some scripts can track spiders (usually tracking fast requests), but if your visitor is opening links in new windows at faster rates, he is identified as spider and blocked, what can be bad for your website traffic. If visitor is identified as bad spider, IP is automatically added to .htaccess and next request from that IP is blocked for some time. This script could be used in combination with Honey Pot tactic above.

An other version of bot blocking script is looking in list of bad bots IPs or User Agents, and if current IP is found there, it is automatically added to .htacces, and blocked.

Search engine spider or visitor could be accidentally blocked using above scripts. My recommendation is - do it by hand. It is time consuming, but filtering worst offenders, you can easy find new. First block by User Agent, others by IP. IP list could grow , check for visits from the blocked IP. If there is no more visits recorded in log, remove that IP from the list.


Blocking by User Agent

Look for error 500 in your website log. If it is normal request for some page and server response code is 500 (what usually means script problem), it could be that some user agents (browsers) are blocked on server level (for example, if mod_security is used). You don't need to block already blocked bots (server response code 403).
To block bad bots with known User Agent add to .htaccess:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

Sure, this above is only an older example, replace User Agents (UA) above, or add more, just keep last line without [OR], as above.

A few words about matching. "[NC]" means lower and upper case matched (Zeus and zeus).
"^" means beginning of string (Zeus), without "^" it will match anywhere in UA string (i.e. IamZeus UA is blocked too)

If User Agent name contains space, backslash must be added before(see Xaldon above)


This below is a classic block of well known scrapers and comment spammers User Agents, but allowed if they are coming from search engines IPs (two rows below, with exclamation mark). From unknown reason, those visits could be with Java User Agent. I doubt that is neccessary anymore, but recent case with WGet User Agent, spidering from Yahoo IP, shows that even big search engines can make a mistake. Since WGet (downloader Agent) is blocked on many sites, some of them were deindexed from Yahoo, according to some reports of those sites owners...

So, take those IP addresses below like example how to exclude some IP range from blocking.

You can add more User Agents here, following code pattern, or simply move first row to above example, if you prefer to use that. In that case, replace [NC] with [OR] or use [NC,OR]. Anyway, this code below (first line) is nothing other than "compressed" version of above.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^(Python[-.]?urllib|Java/?[1-9]\.[0-9]|libwww-perl/?[1-9]\.[0-9]) [NC]
RewriteCond %{REMOTE_ADDR} !^207\.126\.2(2[4-9]|3[0-9])\.
RewriteCond %{REMOTE_ADDR} !^216\.239\.(3[2-9]|[45][0-9]|6[0-3])\.
RewriteRule .* - [F,L]


Some examples of bad User Agents, collected from traps in May 2013, mostly .ua and .ru origin using proxies, security holes scanners, referral and mail form spammers, scrapers/email scrapers, older tools, free to block use patterns below (in bold), including bracket :

Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)

Mozilla/2.0 (compatible; MSIE 3.01; Windows 98)

Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)

Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)


Custom 403 error page

If you are using custom 403 page or script, to enable access (page could be blocked too, if it is on the same website), you need to add following to above directives (whatever version is used):

RewriteCond %{REQUEST_FILENAME} !/403.html
or
RewriteCond %{REQUEST_URI} !/errors/

... what means that above page or directory are not blocked. Only with this directive added, blocked visitor or spider can get access to your custom 403 error page or directory.

Why custom error page or script? If visitor is accidentally blocked, you can give him a chance to unblock his IP, adding captcha images, or some link what web robot can't follow


Blocking by IP or domain

To block bad bots with known IP or domain add to .htaccess:

<Limit GET HEAD POST>
order allow,deny
deny from 123.456.789.10
deny from 123.456.789.
deny from somedomain.com
allow from all
</LIMIT>

Example above, first line - particular IP is blocked, second line blocks IP block (more IPs), and third - domain, if IP is resolved to DNS. In second example, it is possible to "catch" visitors from other ISP, in that case, it is better to use CIDR or subnet. When looking to WhoIs info, that is under "CIDR:" or "route:", and it is looking like this: 123.0.0/22

Here is real example: 67.228.0.0/16 is covering this IP range: 67.228.0.0 - 67.228.255.255

If your hosting account Control Panel has that option (IP block), you can do that from there. A few words about IP blocking. Various spammers, hackers...etc are using proxies, forged IPs and infected computers (bot nets) for attacks, to block all of them, you will need large .htaccess file, and that is additional server load. Use IP block only against well known static IPs (check WhoIs before), for others, check for something common (UA, files requested...).


Whitelisting

With this scraping/spamming/bad bots problem getting worse every day, add kids with their scripts and dream about their search engine, that gaming could eat more than 50% of your bandwidth. Just count "visitors" without referral string only...

White list is opposed to Black list, what means that only user agents on that list are allowed, all others are blocked. This is not for beginners, you don't want to block access to most of your live visitors. Just example what can be added to White list: Mozilla x.x, GoogleBot, Slurp, msnbot, and any other good bot or browser. For this kind of blocking, you need to study User Agents visiting your website. Be carefull with Whitelisting, it can help blocking bad bots "in advance", but also it can block some innocent users.


Website log is the source of...

... large amount of useful data. Here is an example how to find bad guys.

You need a little programming knowledge for this (any language). First, filter all visits without referral string, add IP and User Agent to separate file. Sort that file by IP and later, by User Agent. Larger sample - better.

Now, look at the same or similar IP (or DNS if IP is resolved). Same User Agent or different?
Take that IP, discard numbers after last dot, and parse, say, monthly website log to check behaviour of that "visitor". Fast requests means that is the web robot, especially if images are not requested. Different browsers, it could be some kind of proxy (AOL, for example), or scraper trying to hide. In scraper case, there is a possible pattern. There are usually four or five different User Agents used, random, or changed after every request. If we are sure that is a bad guy, we can block them using that IP (or IP block). This is the way to identify bad proxies. If many strange browsers (without referral string) are coming from that IP block and you can see Java UAs there, you can be sure that it is one of them.

Sort file by User Agents (UA) now. Here, you can find all kinds of spiders UA. Some of them are good, some are bad. Some are very bad. Skip those with spider name and url to page for webmasters for now. Our field of interest is hidden crawling. Many scrapers and spammers are using software or scripts with option to change UA to avoid various website filters. Some of those spamming and SEO software have easy recognized default User Agent.
What we are hunting here is the UA similar to normal web browser. To confirm difference, you have to know UA of normal Web browser. Small difference, sometimes typing error, sometimes default UA of that software can uncover hidden crawling. Parse Website log looking for behaviour of that that UA to confirm. Don't look at different IPs, many of them are using proxies to hide real IP, or it is popular software, what means more different IPs (users). Block that offender using UA pattern match, there is always some fingerprint there.

If you are using Honey Pot with well hidden link, it is better to use that list, less chance that you will block some innocent visitor coming from bookmark.

Once your code is on place, you can start looking in your Website log for 403 or 500 server response codes (normal response code is 200), or use "User Agent Switcher", extension for Firefox browser.

Here is an example of bad bot scanner/detector ...bad bots hunt

Another example, for protecting redirection from bad bots, and php script for download php link redirect script



Bookmark and Share







to top