User Tools

Site Tools


apache:harden_apache:use_.htaccess_to_hard-block_spiders_and_crawlers

Apache - Harden Apache - Use .htaccess to hard-block spiders and crawlers

The .htaccess is a (hidden) file which can be found in any directory.

WARNING: Make a backup copy of the .htaccess file, as one dot or one comma too much or too little, can render your site inaccessible.


Redirect web requests coming from certain IP addresses or user agents

This blocks excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a “403 – Forbidden”, before the request even hits the webserver.

Add the following lines to a website's .htaccess file:

.htaccess
# Redirect bad bots to one page.
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
RewriteCond %{REQUEST_URI} !\/nocrawler.html
RewriteRule .* http://yoursite/nocrawler.html [L]

NOTE: This catches the server-hogging spiders, bots, crawlers with a substring of their user-agent’s name (case insensitive).

  • End each line with the user-agent string with [NC, OR], except the last bot which has [NC] only.

Redirect unwanted crawlers to a dummy html page

Redirects unwanted crawlers to a dummy html file http://yoursite/nocrawler.html in your root directory.

An example could be:

nocrawler.html
<!DOCTYPE html>
<html>
<body>
<p>This crawler was blocked</p>
</body>
</html> 

NOTE: The last line RewriteCond %{REQUEST_URI} !\/nocrawler.html is needed to avoid looping.


Alternative Approach

The previous method re-directed any request from the blocked spiders or crawlers to one page. That is the “friendly” way. However, it you get A LOT of spider requests, this also means that your Apache server will do double work: It will get the original request, which is redirected, and then get a 2nd request to deliver your “nocrawler.htm”-file.

While it will help prevent bots, spiders and crawlers, it won’t ease off the pressure on your Apache server.

A hard -and simple- way to block unwanted spiders, crawlers and other bots, is to return a “403 – Forbidden”, and that is the end of it.

Add this code in your .htaccess:

.htaccess
# Block bad bots with a 403.
SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
SetEnvIfNoCase User-Agent "Twitterbot" bad_bot
SetEnvIfNoCase User-Agent "Baiduspider" bad_bot
SetEnvIfNoCase User-Agent "MetaURI" bad_bot
SetEnvIfNoCase User-Agent "mediawords" bad_bot
SetEnvIfNoCase User-Agent "FlipboardProxy" bad_bot
 
<Limit GET POST HEAD>
  Order Allow,Deny
  Allow from all
  Deny from env=bad_bot
</Limit>

Deny by IP Address

Block attempts from 123.234.11.* and 192.168.12.*.

.htaccess
# Deny malicous crawlers IP addresses.
deny from 123.234.11.
deny from 192.168.12.

apache/harden_apache/use_.htaccess_to_hard-block_spiders_and_crawlers.txt · Last modified: 2023/07/17 11:24 by peter

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki