[an error occurred while processing this directive]

» Home » List of articles » How to protect web page content - Part 1

Website content protection

Honestly, there is no magic to protect page content from theft, but you have a chance to find copycats and give them what they deserve.

Why they are stealing content from web sites?

Because they are lazy and greedy. There is no fair play in marketing game, and site content stealing is noting to wonder about. And, when you see "make 1000 pages with one click" and similar stuff, what do you think, where content on those pages is coming from? Written from the author of that "tool"?
No. It is scraped from article directories, RSS feeds, "public content" sites, search engine results, from everywhere where is possible to get "free" content. Author of that content doesn't know that his site is "scraped" and content stolen.

Although you don't need to worry if your content is not scraped, search engine results "scraping" can hurt your web site, in extreme cases, some pages from your site can be excluded (banned) from search engine results pages. In this case, your title and description listed on search engine result pages are used to gain position on search engines for those "directory" pages. Your site title and description are your content, displayed there without permission.
If you see that your site is listed there without or with wrong link (no live link to your site), report SPAM to that search engine. Especially there is no any contact info on that site.

Sometimes, some pages from your site are banned from search engine listings. If it is not temporary glitch, it could be that your page is copied somewhere and cloaked to get your position on search engine. Looking at search engine listing, on page where your page was positioned before, you can maybe spot some "new" site, with similar title and description, but going to that site, you will find different content. It could be that your page is stolen and cloaked, and search engine duplicate content filter "found" that your original page is a copy. And you are out. Until you change something on page to get it different from previous version. It will then back to search engine listing.

Duplicate content filter is not that "effective" like we think. Use some specific keyword from your page, and look at websites ranked below or above your website. Don't worry if title and description of listed websites are different. That doesn't mean that you will not find your copied content there. More specific keyword (or long tail keyword) and you have a better chance to find website with your stolen content.

Every request for some page on your site without referral string is suspicious, if it is not well known search engine spider. When same "browser" from the same IP is visiting particular page often, you can be sure that hijacker is "refreshing" your cloaked page.

Oldest reason for web page theft is search engine position of that page. We all know that, when you try to imitate competition, that is possible to get similar position for other keyword using the same tactic. But, copycats beginners don't know how to tweak stolen content, and they don't know that page content (onpage optimization) is not enough to get the same position.

Search engine position theft

Google 302 BUG. In the past, for link redirections, perl script are used, residing in cgi-bin directory, and that directory was protected from search engine spiders, i.e. redirections were not followed. After php becomes popular, php redirections are followed by web spiders. Since default server response for these scripts was 302 (document moved temporary), page where script is redirecting is indexed, but, url of script is displayed instead of url of indexed page.

This was not a big problem, since visitor was redirected to indexed page anyway. Until some webmasters realize that this can be used to get some commission. Hence, script modified and visitor sent to affiliate link instead to original page. Although this bug is partially solved, check if your web page is indexed under correct url.

Next example, when your Web page is indexed under different url is proxy hijacking, when proxy cache is not protected and your page is indexed under proxy domain. Another example is when you spot visits from well known search engine spider (user agent field in weblog) but from different IP address. In most cases, that IP is redirecting search engine spider to your content, hoping that your content will be indexed under his domain. Plain content theft, and offending IP address should be blocked immediately.

Image hotlinking

If your content is primary media (images, music, movies,flash games...) you probably have this problem. Webmasters are doing this primary to attract more visitors to their website, some of them to save bandwidth (stealing yours), and some of them thinking that this is permitted. Image hotlinking involves very high risk for thief.

With only one line in .htaccess, you can push any script you want instead of that hotlinked image. But, don't do that. Just rename original image and add some other image (your choice) what will tell to thief that hotlinking is not that safe. ! don't forget to update your links, and also, if your image is indexed in GoogleImages, take that into account. Second way is to use .htacces to redirect request for image only (referral not your domain) to page where image is displayed.

How to spot content theft

Copycats beginners are not a problem, problem are sophisticated search engine spammers. They know how to tweak (rewrite) and cloak page, they know what needs to be removed from page to avoid to be found. There is even tool exists for rewriting stolen articles.

Even then they can be spotted. Using advanced search engine queries and looking for one or a few sentences, or specific word, or phrase, you can find them on some sites. Visiting those sites, you can check is it your page content copied there. In case that you can't find anything on that site, look at the cached page on search engine. No link to cached page? You can be sure that page is cloaked when caching is disabled. Try with some other specific phrase. Same site appears on search engine results, and nothing when you are visiting their site?

If you can spoof User Agent (some browsers can do that, plug-in for FireFox, for example) to imitate search engine spider, try to visit site. If page is the same, they are probably using IP delivery.

If you are sure that content from your page appears on search result under domain of that site, report that domain to Search Engine. They will visit that site from other IP and compare results. If site is cloaked, it will be banned from search engine.

You can also report abuse to their hosting company. Before anything, try to contact webmaster of that web site. If he/she refuses to remove copied content, (in most cases it is removed) than report them to search engine and their ISP. Or, if copycat lives near you...

How to protect web page content

Let's clear something. If web browser can read your web page, page can be copied or edited. It means that you can't protect web page content. Source code can be moved, hidden, "encrypted", but it must be readable from browser.

Say, deleting line breaks, source code can be in one line. It is little harder to tweak. Using ENTER (adding line breaks), you can push source code down, below visible area. In that case, when copycat opens page in editor, it is looking like there is no any source code. Little warning can be added on top of source, enclosed with comment tag.

Personalizing content with story and some example connected with your website, with links to other parts to your site, added on that way that content is not readable (no sense) without other linked pages.

Page content can be "encrypted" using JavaScript "escape" or translating characters (source) to hexadecimal codes. That is little harder to tweak (decode), but if they want to steal they will steal.

Above ways can be used to confuse potential copycat.
Serious copycat is using web spider to steal your content, and no, not from your site, from cache of search engine. Your cached page. To be sure that is real page, not cloaked.

Web page cloaking

Page cloaking is "invented" to protect web page content from competition. One version (optimized for search engine position) is delivered to search engine spiders, and other (not optimized) to visitors and other spiders (including competitors). User agent, and/or spider IP is used to spot search engine spiders.

Now, thieves are using cloaking to protect "their" stolen content. If caching is enabled, redirection is used, so, before cached page is loaded, you are redirected to some affiliate link. Caching (on search engine) can be disabled, but without good reason, that is suspicious and small number of hijackers are using that.

I don't want to suggest that web page cloaking is good to use. If your base of spiders IP is not refreshed often, it could happen that search engine spider, visiting from some new IP, find that your page is not the same. And you are in trouble.

Web page cloaking (or IP delivery) could be used against well known content scrapers. Just give them some "nice looking page" instead of your real content. Add some unusual text to that page, so you can find that easy on search engines to find where your special page is hosted.

Spider trap

Web spider traps are not that effective like before, but you can use it to spot content scrapers. Make some link using transparent image (pixel) or dot, on some prominent page (sitemap or main page). Make some directory, push some index page there. Link that page from above pixel or dot. Open robots.txt, disallow that page (directory). Search engine spiders will obey robots.txt and they will not go there, while scrapers will go straight there, hoping that content is unique (SE spiders not allowed, means content not indexed).

In that page, plug some SSI script to save IPs or simple analyze website log to see who visited that "forbidden" page. Use WhoIs to check that IPs and block them from your website.

Traps in page content

Although you can't stop hijackers if they want to steal your content, you can do something to find where it is. Cloaked or not, your stolen page is searchable on search engines.

As mentioned above, you can use search engines to find specific text. Also, you can use copyscape.com to do it better. You can help yourself to find it faster, adding misspellings or unusual words to content. Letter "l" can be replaced with number "1"(one), Also, Letter "O" with "0" and so on...
Say, word "SJKTRWO" is not that common. Try to search google for that word, and you could find where this page was copied.

While copycat can, and will remove/replace info between HEAD and /HEAD, links, copyright information and any sign of your domain name inside content, specific phrases and words are still there. This is still the most effective way to find copied content.

Bugs

...or tracking pixels, were used from email spammers to check active email accounts. When message is opened online, small gif is pulled from spammer's site, and he knows that you are reading email. You can use this tactic on your site.

Adding 1x1 pixel picture (invisible) with absolute URI (using full url - http)to that pic , when your stolen page is displayed on other site, that tracking pixel is pulled from your site. Checking web site log, you can see from what domain is pixel pulled. Now, you can do nasty things using that call with something other, what can shock webmaster and his visitors there. Who plays with fire...

BASE HREF tag

"Base href" tag (if used) is annoying when you are testing links on pages offline. You can't browse offline, because browser is looking at that tag trying to contact that domain where page resides.

Example: On this page, relative URI to main page is ../index.html. So I can browse this site off line, checking links and so on. If BASE HREF is added with "http://www.affiliatebeginnersguide.com/articles/page_content.html", browser "thinks" that index.html page resides on site in BASE HREF tag.

If you are using relative URI for pages (index.html instead of http://domain.com/index.html to reach home page from some other page on site) and BASE HREF tag, if some of your pages are copied to other site (domain) , when visitor is there, click on any link on that page will transfer him to your site.

More, if Google spider is looking for nonexisting pages on your site, something like xbdivnfg.html, or javascripts files, you can be sure that base href tag was not removed from stolen page.

Anyway, if you want to use this tag, it is looking like this:

Can be without page name, but be sure to include slash on the end. Place it between HEAD and /HEAD.
You can comment it while page is offline (to enable browsing offline) and uncomment before upload.

Al least, this META tag will protect you when Proxy Hijacking is in question. Proxy hijacking is the situation where your web page is cached on some proxy server, and indexed under that url.

Canonical tag

One more tag could be useful in this case. This tag is used to avoid duplicate content, for example to set preferred domain. For this use, it is acting like base href tag, showing to Google spider where page belongs.

Frame buster

Your stolen page could be framed and hidden from visitors. Frame buster (small piece of JavaScript) will break frames and display your page.

This code below, it seems that it is working on PC and MAC browsers, is borrowed from Lori's Web site, more details about this theme you can find there.

Conclusion

As we can see, page content cannot be protected, if web browser can read, anybody can read. These small tricks and tips (if used) can discourage only copycats without experience, but also, that can show to professional copycats that you know about page content theft, and that you know how to find stolen content.
That can make them, at least - nervous.

To know more about your "enemies" and scraping methods, browse around Black Hat forums, Ad.ult webmasters forums and sites... You can defend your site content better when you know what "weapons" are in use.

Part 2 - how to block bad web robots - content scrapers identification, known content scrapers...

to top