====== Apache - Harden Apache - Use .htaccess to hard-block spiders and crawlers ====== The **.htaccess** is a (hidden) file which can be found in any directory. **WARNING**: Make a backup copy of the **.htaccess** file, as one dot or one comma too much or too little, can render your site inaccessible. ---- ===== Redirect web requests coming from certain IP addresses or user agents ===== This blocks excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a **"403 – Forbidden"**, before the request even hits the webserver. Add the following lines to a website's .htaccess file: # Redirect bad bots to one page. RewriteEngine on RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR] RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR] RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR] RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC] RewriteCond %{REQUEST_URI} !\/nocrawler.html RewriteRule .* http://yoursite/nocrawler.html [L] **NOTE:** This catches the server-hogging spiders, bots, crawlers with a substring of their user-agent’s name (case insensitive). * End each line with the user-agent string with **[NC, OR]**, except the last bot which has **[NC]** only. ---- ===== Redirect unwanted crawlers to a dummy html page ===== Redirects unwanted crawlers to a dummy html file http://yoursite/nocrawler.html in your root directory. An example could be:

This crawler was blocked

**NOTE:** The last line **RewriteCond %{REQUEST_URI} !\/nocrawler.html** is needed to avoid looping. ---- ===== Alternative Approach ===== The previous method re-directed any request from the blocked spiders or crawlers to one page. That is the “friendly” way. However, it you get A LOT of spider requests, this also means that your Apache server will do double work: It will get the original request, which is redirected, and then get a 2nd request to deliver your “nocrawler.htm”-file. While it will help prevent bots, spiders and crawlers, it won’t ease off the pressure on your Apache server. A hard -and simple- way to block unwanted spiders, crawlers and other bots, is to return a **"403 – Forbidden"**, and that is the end of it. Add this code in your .htaccess: # Block bad bots with a 403. SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot SetEnvIfNoCase User-Agent "Twitterbot" bad_bot SetEnvIfNoCase User-Agent "Baiduspider" bad_bot SetEnvIfNoCase User-Agent "MetaURI" bad_bot SetEnvIfNoCase User-Agent "mediawords" bad_bot SetEnvIfNoCase User-Agent "FlipboardProxy" bad_bot Order Allow,Deny Allow from all Deny from env=bad_bot ---- ===== Deny by IP Address ===== Block attempts from 123.234.11.* and 192.168.12.*. # Deny malicous crawlers IP addresses. deny from 123.234.11. deny from 192.168.12. ----