Edit detail for Blocking bad bots in httpd.conf revision 1 of 1

1
Editor: betabug
Time: 2007/05/11 00:14:13 GMT-7
Note:

changed:
-
apache can block user agents based on their user agent string. This doesn't help with a lot of user agents, who fake an IE user agent string, but it gives some relieve to a web server under siege from the robots. The first part is in the httpd.conf, but outside any !VirtualHosts::

  # blocking bots, spammers, harvesters...
  SetEnvIfNoCase User-Agent "Download Ninja 2.0" block_bad_bots
  SetEnvIfNoCase User-Agent "Fetch API Request" block_bad_bots
  SetEnvIfNoCase User-Agent "HTTrack" block_bad_bots
  SetEnvIfNoCase User-Agent "ia_archiver" block_bad_bots
  SetEnvIfNoCase User-Agent "JBH Agent 2.0" block_bad_bots
  SetEnvIfNoCase User-Agent "QuepasaCreep" block_bad_bots
  SetEnvIfNoCase User-Agent "Program Shareware 1.0.0" block_bad_bots
  SetEnvIfNoCase User-Agent "TestBED.6.3" block_bad_bots
  SetEnvIfNoCase User-Agent "WebAuto" block_bad_bots
  SetEnvIfNoCase User-Agent "WebCopier" block_bad_bots
  SetEnvIfNoCase User-Agent "Wget/1.8.2" block_bad_bots
  SetEnvIfNoCase User-Agent "Offline Explorer" block_bad_bots
  SetEnvIfNoCase User-Agent "Franklin Locator" block_bad_bots
  SetEnvIfNoCase User-Agent "LWP::Simple" block_bad_bots
  SetEnvIfNoCase User-Agent "Larbin" block_bad_bots
  SetEnvIfNoCase User-Agent "AA" block_bad_bots
  SetEnvIfNoCase User-Agent "Rufus Web Miner" block_bad_bots
  SetEnvIfNoCase User-Agent "Port Huron Labs" block_bad_bots
  SetEnvIfNoCase User-Agent "Sphider" block_bad_bots
  SetEnvIfNoCase User-Agent "voyager/1.0" block_bad_bots
  SetEnvIfNoCase User-Agent "DynaWeb" block_bad_bots
  SetEnvIfNoCase User-Agent "EmailCollector/1.0" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "EmailSiphon" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "EmailWolf 1.00" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "ExtractorPro" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "Crescent Internet ToolPak" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "CherryPicker/1.0" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "CherryPickerSE/1.0" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "CherryPickerElite/1.0" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "NICErsPRO" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "WebBandit/2.1" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "WebBandit/3.50" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "webbandit/4.00.0" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "WebEMailExtractor/1.0B" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "autoemailspider" block_bad_bots_bot
  SetEnvIfNoCase User-Agent "^libwww-perl" block_bad_bots
  SetEnvIfNoCase User-Agent "^WordPress/2\.0\.2" block_bad_bots
  SetEnvIfNoCase User-Agent "^Opera/9\.0 \(Windows NT 5\.1; U; en\)" block_bad_bots
  SetEnvIfNoCase User-Agent "^PycURL/7\.15\.5$" block_bad_bots
  SetEnvIfNoCase User-Agent "^TurnitinBot" block_bad_bots
  SetEnvIfNoCase User-Agent "^West Wind Internet Protocols" block_bad_bots
  SetEnvIfNoCase User-Agent "^POE::Component::Client::HTTP/" block_bad_bots
  SetEnvIfNoCase User-Agent "^User-Agent: Mozilla/4.0" block_bad_bots
  SetEnvIfNoCase User-Agent "netforex" block_bad_bots
  SetEnvIfNoCase User-Agent "^Java/" block_bad_bots
  SetEnvIfNoCase User-Agent "^SMBot/" block_bad_bots
  SetEnvIfNoCase User-Agent "^Mozilla/4.0 \(compatible; MSIE 4\.0; Windows NT; \.\.\.\.\.\./1\.0 \)$" block_bad_bots
  SetEnvIfNoCase User-Agent "envolk" block_bad_bots
  SetEnvIfNoCase User-Agent "^TMCrawler" block_bad_bots
  SetEnvIfNoCase User-Agent "^Opera/6\.01 \(Windows ME; U\) \[en\]" block_bad_bots
  SetEnvIfNoCase User-Agent "^NASA Search" block_bad_bots
  SetEnvIfNoCase User-Agent "^TrackBack/" block_bad_bots
  SetEnvIfNoCase User-Agent "^Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.0; Maxthon\)$" block_bad_bots
  SetEnvIfNoCase User-Agent "QihooBot" block_bad_bots
  SetEnvIfNoCase User-Agent "^Gigabot/.\.." block_bad_bots
  SetEnvIfNoCase User-Agent "K-Meleon/0\.8" block_bad_bots
  SetEnvIfNoCase User-Agent "Twiceler" block_bad_bots

Note that this is my current collection, no guarantees for fitfullness, mistakes, or being up-to-date (or anything). There might be legitimate user agent strings in there, it's always a question of tradeoffs.

Step 2, inside each virtual host directive there is usually a "Location" directive. Given the above declarations, I use this::

    <Location "/">
    Order Allow,Deny
    Allow from all
    Deny from env=block_bad_bots
    </Location>

The "blocking by IP" part is done by my packet filter (aka "Firewall"), in my case "pf".

apache can block user agents based on their user agent string. This doesn't help with a lot of user agents, who fake an IE user agent string, but it gives some relieve to a web server under siege from the robots. The first part is in the httpd.conf, but outside any VirtualHosts:

# blocking bots, spammers, harvesters...
SetEnvIfNoCase User-Agent "Download Ninja 2.0" block_bad_bots
SetEnvIfNoCase User-Agent "Fetch API Request" block_bad_bots
SetEnvIfNoCase User-Agent "HTTrack" block_bad_bots
SetEnvIfNoCase User-Agent "ia_archiver" block_bad_bots
SetEnvIfNoCase User-Agent "JBH Agent 2.0" block_bad_bots
SetEnvIfNoCase User-Agent "QuepasaCreep" block_bad_bots
SetEnvIfNoCase User-Agent "Program Shareware 1.0.0" block_bad_bots
SetEnvIfNoCase User-Agent "TestBED.6.3" block_bad_bots
SetEnvIfNoCase User-Agent "WebAuto" block_bad_bots
SetEnvIfNoCase User-Agent "WebCopier" block_bad_bots
SetEnvIfNoCase User-Agent "Wget/1.8.2" block_bad_bots
SetEnvIfNoCase User-Agent "Offline Explorer" block_bad_bots
SetEnvIfNoCase User-Agent "Franklin Locator" block_bad_bots
SetEnvIfNoCase User-Agent "LWP::Simple" block_bad_bots
SetEnvIfNoCase User-Agent "Larbin" block_bad_bots
SetEnvIfNoCase User-Agent "AA" block_bad_bots
SetEnvIfNoCase User-Agent "Rufus Web Miner" block_bad_bots
SetEnvIfNoCase User-Agent "Port Huron Labs" block_bad_bots
SetEnvIfNoCase User-Agent "Sphider" block_bad_bots
SetEnvIfNoCase User-Agent "voyager/1.0" block_bad_bots
SetEnvIfNoCase User-Agent "DynaWeb" block_bad_bots
SetEnvIfNoCase User-Agent "EmailCollector/1.0" block_bad_bots_bot
SetEnvIfNoCase User-Agent "EmailSiphon" block_bad_bots_bot
SetEnvIfNoCase User-Agent "EmailWolf 1.00" block_bad_bots_bot
SetEnvIfNoCase User-Agent "ExtractorPro" block_bad_bots_bot
SetEnvIfNoCase User-Agent "Crescent Internet ToolPak" block_bad_bots_bot
SetEnvIfNoCase User-Agent "CherryPicker/1.0" block_bad_bots_bot
SetEnvIfNoCase User-Agent "CherryPickerSE/1.0" block_bad_bots_bot
SetEnvIfNoCase User-Agent "CherryPickerElite/1.0" block_bad_bots_bot
SetEnvIfNoCase User-Agent "NICErsPRO" block_bad_bots_bot
SetEnvIfNoCase User-Agent "WebBandit/2.1" block_bad_bots_bot
SetEnvIfNoCase User-Agent "WebBandit/3.50" block_bad_bots_bot
SetEnvIfNoCase User-Agent "webbandit/4.00.0" block_bad_bots_bot
SetEnvIfNoCase User-Agent "WebEMailExtractor/1.0B" block_bad_bots_bot
SetEnvIfNoCase User-Agent "autoemailspider" block_bad_bots_bot
SetEnvIfNoCase User-Agent "^libwww-perl" block_bad_bots
SetEnvIfNoCase User-Agent "^WordPress/2\.0\.2" block_bad_bots
SetEnvIfNoCase User-Agent "^Opera/9\.0 \(Windows NT 5\.1; U; en\)" block_bad_bots
SetEnvIfNoCase User-Agent "^PycURL/7\.15\.5$" block_bad_bots
SetEnvIfNoCase User-Agent "^TurnitinBot" block_bad_bots
SetEnvIfNoCase User-Agent "^West Wind Internet Protocols" block_bad_bots
SetEnvIfNoCase User-Agent "^POE::Component::Client::HTTP/" block_bad_bots
SetEnvIfNoCase User-Agent "^User-Agent: Mozilla/4.0" block_bad_bots
SetEnvIfNoCase User-Agent "netforex" block_bad_bots
SetEnvIfNoCase User-Agent "^Java/" block_bad_bots
SetEnvIfNoCase User-Agent "^SMBot/" block_bad_bots
SetEnvIfNoCase User-Agent "^Mozilla/4.0 \(compatible; MSIE 4\.0; Windows NT; \.\.\.\.\.\./1\.0 \)$" block_bad_bots
SetEnvIfNoCase User-Agent "envolk" block_bad_bots
SetEnvIfNoCase User-Agent "^TMCrawler" block_bad_bots
SetEnvIfNoCase User-Agent "^Opera/6\.01 \(Windows ME; U\) \[en\]" block_bad_bots
SetEnvIfNoCase User-Agent "^NASA Search" block_bad_bots
SetEnvIfNoCase User-Agent "^TrackBack/" block_bad_bots
SetEnvIfNoCase User-Agent "^Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.0; Maxthon\)$" block_bad_bots
SetEnvIfNoCase User-Agent "QihooBot" block_bad_bots
SetEnvIfNoCase User-Agent "^Gigabot/.\.." block_bad_bots
SetEnvIfNoCase User-Agent "K-Meleon/0\.8" block_bad_bots
SetEnvIfNoCase User-Agent "Twiceler" block_bad_bots

Note that this is my current collection, no guarantees for fitfullness, mistakes, or being up-to-date (or anything). There might be legitimate user agent strings in there, it's always a question of tradeoffs.

Step 2, inside each virtual host directive there is usually a "Location" directive. Given the above declarations, I use this:

<Location "/">
Order Allow,Deny
Allow from all
Deny from env=block_bad_bots
</Location>

The "blocking by IP" part is done by my packet filter (aka "Firewall"), in my case "pf".