====== Apache - Harden Apache - Use .htaccess to hard-block spiders and crawlers ======
The **.htaccess** is a (hidden) file which can be found in any directory.
**WARNING**: Make a backup copy of the **.htaccess** file, as one dot or one comma too much or too little, can render your site inaccessible.
----
===== Redirect web requests coming from certain IP addresses or user agents =====
This blocks excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a **"403 – Forbidden"**, before the request even hits the webserver.
Add the following lines to a website's .htaccess file:
# Redirect bad bots to one page.
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
RewriteCond %{REQUEST_URI} !\/nocrawler.html
RewriteRule .* http://yoursite/nocrawler.html [L]
**NOTE:** This catches the server-hogging spiders, bots, crawlers with a substring of their user-agent’s name (case insensitive).
* End each line with the user-agent string with **[NC, OR]**, except the last bot which has **[NC]** only.
----
===== Redirect unwanted crawlers to a dummy html page =====
Redirects unwanted crawlers to a dummy html file http://yoursite/nocrawler.html in your root directory.
An example could be:
This crawler was blocked
**NOTE:** The last line **RewriteCond %{REQUEST_URI} !\/nocrawler.html** is needed to avoid looping.
----
===== Alternative Approach =====
The previous method re-directed any request from the blocked spiders or crawlers to one page. That is the “friendly” way. However, it you get A LOT of spider requests, this also means that your Apache server will do double work: It will get the original request, which is redirected, and then get a 2nd request to deliver your “nocrawler.htm”-file.
While it will help prevent bots, spiders and crawlers, it won’t ease off the pressure on your Apache server.
A hard -and simple- way to block unwanted spiders, crawlers and other bots, is to return a **"403 – Forbidden"**, and that is the end of it.
Add this code in your .htaccess:
# Block bad bots with a 403.
SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
SetEnvIfNoCase User-Agent "Twitterbot" bad_bot
SetEnvIfNoCase User-Agent "Baiduspider" bad_bot
SetEnvIfNoCase User-Agent "MetaURI" bad_bot
SetEnvIfNoCase User-Agent "mediawords" bad_bot
SetEnvIfNoCase User-Agent "FlipboardProxy" bad_bot
Order Allow,Deny
Allow from all
Deny from env=bad_bot
----
===== Deny by IP Address =====
Block attempts from 123.234.11.* and 192.168.12.*.
# Deny malicous crawlers IP addresses.
deny from 123.234.11.
deny from 192.168.12.
----