Table of Contents
Apache - Harden Apache - Use .htaccess to hard-block spiders and crawlers
The .htaccess is a (hidden) file which can be found in any directory.
WARNING: Make a backup copy of the .htaccess file, as one dot or one comma too much or too little, can render your site inaccessible.
Redirect web requests coming from certain IP addresses or user agents
This blocks excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a “403 – Forbidden”, before the request even hits the webserver.
Add the following lines to a website's .htaccess file:
- .htaccess
# Redirect bad bots to one page. RewriteEngine on RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR] RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR] RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR] RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC] RewriteCond %{REQUEST_URI} !\/nocrawler.html RewriteRule .* http://yoursite/nocrawler.html [L]
NOTE: This catches the server-hogging spiders, bots, crawlers with a substring of their user-agent’s name (case insensitive).
- End each line with the user-agent string with [NC, OR], except the last bot which has [NC] only.
Redirect unwanted crawlers to a dummy html page
Redirects unwanted crawlers to a dummy html file http://yoursite/nocrawler.html in your root directory.
An example could be:
NOTE: The last line RewriteCond %{REQUEST_URI} !\/nocrawler.html is needed to avoid looping.
Alternative Approach
The previous method re-directed any request from the blocked spiders or crawlers to one page. That is the “friendly” way. However, it you get A LOT of spider requests, this also means that your Apache server will do double work: It will get the original request, which is redirected, and then get a 2nd request to deliver your “nocrawler.htm”-file.
While it will help prevent bots, spiders and crawlers, it won’t ease off the pressure on your Apache server.
A hard -and simple- way to block unwanted spiders, crawlers and other bots, is to return a “403 – Forbidden”, and that is the end of it.
Add this code in your .htaccess:
- .htaccess
# Block bad bots with a 403. SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot SetEnvIfNoCase User-Agent "Twitterbot" bad_bot SetEnvIfNoCase User-Agent "Baiduspider" bad_bot SetEnvIfNoCase User-Agent "MetaURI" bad_bot SetEnvIfNoCase User-Agent "mediawords" bad_bot SetEnvIfNoCase User-Agent "FlipboardProxy" bad_bot <Limit GET POST HEAD> Order Allow,Deny Allow from all Deny from env=bad_bot </Limit>
Deny by IP Address
Block attempts from 123.234.11.* and 192.168.12.*.
- .htaccess
# Deny malicous crawlers IP addresses. deny from 123.234.11. deny from 192.168.12.