Monthly Archives: February 2012

What to do about Bots that kill AWS Micro Instances running WordPress

For one of our customers, we leveraged WordPress and its powerful capabilities to create a rather large website consisted of hundred of pages. Because the expected traffic is to be low, we installed their site on a economical AWS Micro Instance which performed well. In the middle of last night, however, the instance’s CPU utilization percentage hit 100% for nearly one hour. Anyone else accessing the site during this period would have had a sluggish if not unresponsive experience.

AWS Micro Instances are great for testing and deploying simple websites that, by their design and market, won’t be required to work very hard. That said, websites are at the mercy of whoever accesses them from the rest of the Internet. Too many accesses during too short of period can overrun the resource allotment of a Micro Instance.

In our investigation, we discovered that a bot called Aboundexbot was the culprit. The Aboundexbot bot wanted to crawl the entire site and quickly at that, an act which threw the CPU Utilization to 100% because AWS micro instances, as their name implies, are limited to a certain amount of CPU activity per unit of time. Unfortuately, Aboundexbot did not throttle it’s access as do other better behaved bots, and it apparently does not have a built-in mechanism (such as a timeout) to detect when it may be overtaxing a site.

In any case, we decided that we just didn’t want Aboundexbot and perhaps some other badly behaved bot to visit our customer’s site so as the keep the site performing well. Our thought was to add a corresponding “disallow” entry to the “robots.txt” file. However, whereas this is a simple task for a regular website, it is more challenging for a WordPress-based site if it has been installed in the domain root. In that case, all of the site’s root file access go through WordPress’s dynamic page generation, including access to the theoretical “robot.txt” file.

In the WordPress wp-includes/ folder, there is a file called functions.php in which there is a function called do_robots() which dynamically creates a “robot.txt” file on demand. But it’s not very sophisticated, allowing for just two types of output depending on the Site Visibility setting under WordPress’ Dashboard > Settings > Privacy page.

We could have added a plug-in that provided finer robot.txts control, and we still might do that, but to get a solution in quickly, we decided to simply enhance the do_robots() function as follows (our code addition in boldface):

function do_robots() {
  header( 'Content-Type: text/plain; charset=utf-8' );

  do_action( 'do_robotstxt' );

  $output = "User-agent: *\n";
  $public = get_option( 'blog_public' );
  if ( '0' == $public ) {
    $output .= "Disallow: /\n";
  } else {
    $site_url = parse_url( site_url() );
    $path = ( !empty( $site_url['path'] ) ) ? $site_url['path'] : '';
    $output .= "Disallow: $path/wp-admin/\n";
    $output .= "Disallow: $path/wp-includes/\n";
    $fbotmore = file_get_contents('./wp-content/robots.txt');
    if ($fbotmore !== false) $output .= $fbotmore;
  }

  echo apply_filters('robots_txt', $output, $public);
}

We are currently on WordPress version 3.3.1. Because different versions of WordPress may have different code for this function, use your programming know-how to add the above two boldfaced lines to the function in the most appropriate way. Note that this is not a permanent change as any significant WordPress upgrade will overwrite this change in the functions.php file.

We then located a list of other badly behaved bots and installed our collective list in the wp-contents/robot.txt file which is now included whenever our domain.com/robot.txt file is accessed.

For your reference, here is what we came up with for the contents of our robots.txt file. Note that WordPress has a few entries of its own that are placed in advance of this content.

User-agent: Aboundexbot
Disallow: /
User-agent: NPBot
Disallow: /
User-agent: TurnitinBot
Disallow: /
User-agent: EmailCollector
Disallow: /
User-agent: EmailWolf
Disallow: /
User-agent: CopyRightCheck
Disallow: /
User-agent: Black Hole
Disallow: /
User-agent: Titan
Disallow: /
User-agent: NetMechanic
Disallow: /
User-agent: CherryPicker
Disallow: /
User-agent: EmailSiphon
Disallow: /
User-agent: WebBandit
Disallow: /
User-agent: Crescent
Disallow: /
User-agent: NICErsPRO
Disallow: /
User-agent: SiteSnagger
Disallow: /
User-agent: ProWebWalker
Disallow: /
User-agent: CheeseBot
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: ia_archiver/1.6
Disallow: /
User-agent: Teleport
Disallow: /
User-agent: TeleportPro
Disallow: /
User-agent: Wget
Disallow: /
User-agent: MIIxpc
Disallow: /
User-agent: Telesoft
Disallow: /
User-agent: Website Quester
Disallow: /
User-agent: WebZip
Disallow: /
User-agent: moget/2.1
Disallow: /
User-agent: WebZip/4.0
Disallow: /
User-agent: Mister PiX
Disallow: /
User-agent: WebStripper
Disallow: /
User-agent: WebSauger
Disallow: /
User-agent: WebCopier
Disallow: /
User-agent: NetAnts
Disallow: /
User-agent: WebAuto
Disallow: /
User-agent: TheNomad
Disallow: /
User-agent: WWW-Collector-E
Disallow: /
User-agent: RMA
Disallow: /
User-agent: libWeb/clsHTTP
Disallow: /
User-agent: asterias
Disallow: /
User-agent: httplib
Disallow: /
User-agent: turingos
Disallow: /
User-agent: spanner
Disallow: /
User-agent: InfoNaviRobot
Disallow: /
User-agent: Harvest/1.5
Disallow: /
User-agent: Bullseye/1.0
Disallow: /
User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /
User-agent: CherryPickerSE/1.0
Disallow: /
User-agent: CherryPickerElite/1.0
Disallow: /
User-agent: WebBandit/3.50
Disallow: /
User-agent: DittoSpyder
Disallow: /
User-agent: SpankBot
Disallow: /
User-agent: BotALot
Disallow: /
User-agent: lwp-trivial/1.34
Disallow: /
User-agent: lwp-trivial
Disallow: /
User-agent: Wget/1.6
Disallow: /
User-agent: BunnySlippers
Disallow: /
User-agent: URLy Warning
Disallow: /
User-agent: Wget/1.5.3
Disallow: /
User-agent: LinkWalker
Disallow: /
User-agent: cosmos
Disallow: /
User-agent: moget
Disallow: /
User-agent: hloader
Disallow: /
User-agent: humanlinks
Disallow: /
User-agent: LinkextractorPro
Disallow: /
User-agent: Offline Explorer
Disallow: /
User-agent: Mata Hari
Disallow: /
User-agent: LexiBot
Disallow: /
User-agent: Web Image Collector
Disallow: /
User-agent: The Intraformant
Disallow: /
User-agent: True_Robot/1.0
Disallow: /
User-agent: True_Robot
Disallow: /
User-agent: BlowFish/1.0
Disallow: /
User-agent: JennyBot
Disallow: /
User-agent: MIIxpc/4.2
Disallow: /
User-agent: BuiltBotTough
Disallow: /
User-agent: ProPowerBot/2.14
Disallow: /
User-agent: BackDoorBot/1.0
Disallow: /
User-agent: toCrawl/UrlDispatcher
Disallow: /
User-agent: WebEnhancer
Disallow: /
User-agent: TightTwatBot
Disallow: /
User-agent: suzuran
Disallow: /
User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /
User-agent: VCI
Disallow: /
User-agent: Szukacz/1.4
Disallow: /
User-agent: QueryN Metasearch
Disallow: /
User-agent: Openfind data gathere
Disallow: /
User-agent: Openfind
Disallow: /
User-agent: Xenu's Link Sleuth 1.1c
Disallow: /
User-agent: Xenu's
Disallow: /
User-agent: Zeus
Disallow: /
User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /
User-agent: RepoMonkey
Disallow: /
User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /
User-agent: Webster Pro
Disallow: /
User-agent: EroCrawler
Disallow: /
User-agent: LinkScan/8.1a Unix
Disallow: /
User-agent: Kenjin Spider
Disallow: /
User-agent: Keyword Density/0.9
Disallow: /
User-agent: Cegbfeieh
Disallow: /
User-agent: SurveyBot
Disallow: /
User-agent: duggmirror
Disallow: /

To test your change to the do_robots(), just access from your favorite browser your domain.com/robot.txt file. Did it work? Let us know.

Hopefully, this change will keep the micro instance from being overtaxed by zealous bots. If you get a bot that simply ignores the robot.txt file, you may have to resort to adding a “deny from” entry in your server configuration, but in our experience we haven’t seem many of those.