The .htaccess is a (hidden) file which can be found in any directory.
WARNING: Make a backup copy of the .htaccess file, as one dot or one comma too much or too little, can render your site inaccessible.
This blocks excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a “403 – Forbidden”, before the request even hits the webserver.
Add the following lines to a website's .htaccess file:
# Redirect bad bots to one page. RewriteEngine on RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR] RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR] RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR] RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC] RewriteCond %{REQUEST_URI} !\/nocrawler.html RewriteRule .* http://yoursite/nocrawler.html [L]
NOTE: This catches the server-hogging spiders, bots, crawlers with a substring of their user-agent’s name (case insensitive).
Redirects unwanted crawlers to a dummy html file http://yoursite/nocrawler.html in your root directory.
An example could be:
NOTE: The last line RewriteCond %{REQUEST_URI} !\/nocrawler.html is needed to avoid looping.
The previous method re-directed any request from the blocked spiders or crawlers to one page. That is the “friendly” way. However, it you get A LOT of spider requests, this also means that your Apache server will do double work: It will get the original request, which is redirected, and then get a 2nd request to deliver your “nocrawler.htm”-file.
While it will help prevent bots, spiders and crawlers, it won’t ease off the pressure on your Apache server.
A hard -and simple- way to block unwanted spiders, crawlers and other bots, is to return a “403 – Forbidden”, and that is the end of it.
Add this code in your .htaccess:
# Block bad bots with a 403. SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot SetEnvIfNoCase User-Agent "Twitterbot" bad_bot SetEnvIfNoCase User-Agent "Baiduspider" bad_bot SetEnvIfNoCase User-Agent "MetaURI" bad_bot SetEnvIfNoCase User-Agent "mediawords" bad_bot SetEnvIfNoCase User-Agent "FlipboardProxy" bad_bot <Limit GET POST HEAD> Order Allow,Deny Allow from all Deny from env=bad_bot </Limit>
Block attempts from 123.234.11.* and 192.168.12.*.
# Deny malicous crawlers IP addresses.
deny from 123.234.11.
deny from 192.168.12.