For one of our customers, we leveraged WordPress and its powerful capabilities to create a rather large website consisted of hundred of pages. Because the expected traffic is to be low, we installed their site on a economical AWS Micro Instance which performed well. In the middle of last night, however, the instance’s CPU utilization percentage hit 100% for nearly one hour. Anyone else accessing the site during this period would have had a sluggish if not unresponsive experience.
AWS Micro Instances are great for testing and deploying simple websites that, by their design and market, won’t be required to work very hard. That said, websites are at the mercy of whoever accesses them from the rest of the Internet. Too many accesses during too short of period can overrun the resource allotment of a Micro Instance.
In our investigation, we discovered that a bot called Aboundexbot was the culprit. The Aboundexbot bot wanted to crawl the entire site and quickly at that, an act which threw the CPU Utilization to 100% because AWS micro instances, as their name implies, are limited to a certain amount of CPU activity per unit of time. Unfortuately, Aboundexbot did not throttle it’s access as do other better behaved bots, and it apparently does not have a built-in mechanism (such as a timeout) to detect when it may be overtaxing a site.
In any case, we decided that we just didn’t want Aboundexbot and perhaps some other badly behaved bot to visit our customer’s site so as the keep the site performing well. Our thought was to add a corresponding “disallow” entry to the “robots.txt” file. However, whereas this is a simple task for a regular website, it is more challenging for a WordPress-based site if it has been installed in the domain root. In that case, all of the site’s root file access go through WordPress’s dynamic page generation, including access to the theoretical “robot.txt” file.
In the WordPress wp-includes/ folder, there is a file called functions.php in which there is a function called do_robots() which dynamically creates a “robot.txt” file on demand. But it’s not very sophisticated, allowing for just two types of output depending on the Site Visibility setting under WordPress’ Dashboard > Settings > Privacy page.
We could have added a plug-in that provided finer robot.txts control, and we still might do that, but to get a solution in quickly, we decided to simply enhance the do_robots() function as follows (our code addition in boldface):
function do_robots() { header( 'Content-Type: text/plain; charset=utf-8' ); do_action( 'do_robotstxt' ); $output = "User-agent: *\n"; $public = get_option( 'blog_public' ); if ( '0' == $public ) { $output .= "Disallow: /\n"; } else { $site_url = parse_url( site_url() ); $path = ( !empty( $site_url['path'] ) ) ? $site_url['path'] : ''; $output .= "Disallow: $path/wp-admin/\n"; $output .= "Disallow: $path/wp-includes/\n"; $fbotmore = file_get_contents('./wp-content/robots.txt'); if ($fbotmore !== false) $output .= $fbotmore; } echo apply_filters('robots_txt', $output, $public); }
We are currently on WordPress version 3.3.1. Because different versions of WordPress may have different code for this function, use your programming know-how to add the above two boldfaced lines to the function in the most appropriate way. Note that this is not a permanent change as any significant WordPress upgrade will overwrite this change in the functions.php file.
We then located a list of other badly behaved bots and installed our collective list in the wp-contents/robot.txt file which is now included whenever our domain.com/robot.txt file is accessed.
For your reference, here is what we came up with for the contents of our robots.txt file. Note that WordPress has a few entries of its own that are placed in advance of this content.
User-agent: Aboundexbot Disallow: / User-agent: NPBot Disallow: / User-agent: TurnitinBot Disallow: / User-agent: EmailCollector Disallow: / User-agent: EmailWolf Disallow: / User-agent: CopyRightCheck Disallow: / User-agent: Black Hole Disallow: / User-agent: Titan Disallow: / User-agent: NetMechanic Disallow: / User-agent: CherryPicker Disallow: / User-agent: EmailSiphon Disallow: / User-agent: WebBandit Disallow: / User-agent: Crescent Disallow: / User-agent: NICErsPRO Disallow: / User-agent: SiteSnagger Disallow: / User-agent: ProWebWalker Disallow: / User-agent: CheeseBot Disallow: / User-agent: ia_archiver Disallow: / User-agent: ia_archiver/1.6 Disallow: / User-agent: Teleport Disallow: / User-agent: TeleportPro Disallow: / User-agent: Wget Disallow: / User-agent: MIIxpc Disallow: / User-agent: Telesoft Disallow: / User-agent: Website Quester Disallow: / User-agent: WebZip Disallow: / User-agent: moget/2.1 Disallow: / User-agent: WebZip/4.0 Disallow: / User-agent: Mister PiX Disallow: / User-agent: WebStripper Disallow: / User-agent: WebSauger Disallow: / User-agent: WebCopier Disallow: / User-agent: NetAnts Disallow: / User-agent: WebAuto Disallow: / User-agent: TheNomad Disallow: / User-agent: WWW-Collector-E Disallow: / User-agent: RMA Disallow: / User-agent: libWeb/clsHTTP Disallow: / User-agent: asterias Disallow: / User-agent: httplib Disallow: / User-agent: turingos Disallow: / User-agent: spanner Disallow: / User-agent: InfoNaviRobot Disallow: / User-agent: Harvest/1.5 Disallow: / User-agent: Bullseye/1.0 Disallow: / User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95) Disallow: / User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 Disallow: / User-agent: CherryPickerSE/1.0 Disallow: / User-agent: CherryPickerElite/1.0 Disallow: / User-agent: WebBandit/3.50 Disallow: / User-agent: DittoSpyder Disallow: / User-agent: SpankBot Disallow: / User-agent: BotALot Disallow: / User-agent: lwp-trivial/1.34 Disallow: / User-agent: lwp-trivial Disallow: / User-agent: Wget/1.6 Disallow: / User-agent: BunnySlippers Disallow: / User-agent: URLy Warning Disallow: / User-agent: Wget/1.5.3 Disallow: / User-agent: LinkWalker Disallow: / User-agent: cosmos Disallow: / User-agent: moget Disallow: / User-agent: hloader Disallow: / User-agent: humanlinks Disallow: / User-agent: LinkextractorPro Disallow: / User-agent: Offline Explorer Disallow: / User-agent: Mata Hari Disallow: / User-agent: LexiBot Disallow: / User-agent: Web Image Collector Disallow: / User-agent: The Intraformant Disallow: / User-agent: True_Robot/1.0 Disallow: / User-agent: True_Robot Disallow: / User-agent: BlowFish/1.0 Disallow: / User-agent: JennyBot Disallow: / User-agent: MIIxpc/4.2 Disallow: / User-agent: BuiltBotTough Disallow: / User-agent: ProPowerBot/2.14 Disallow: / User-agent: BackDoorBot/1.0 Disallow: / User-agent: toCrawl/UrlDispatcher Disallow: / User-agent: WebEnhancer Disallow: / User-agent: TightTwatBot Disallow: / User-agent: suzuran Disallow: / User-agent: VCI WebViewer VCI WebViewer Win32 Disallow: / User-agent: VCI Disallow: / User-agent: Szukacz/1.4 Disallow: / User-agent: QueryN Metasearch Disallow: / User-agent: Openfind data gathere Disallow: / User-agent: Openfind Disallow: / User-agent: Xenu's Link Sleuth 1.1c Disallow: / User-agent: Xenu's Disallow: / User-agent: Zeus Disallow: / User-agent: RepoMonkey Bait & Tackle/v1.01 Disallow: / User-agent: RepoMonkey Disallow: / User-agent: Zeus 32297 Webster Pro V2.9 Win32 Disallow: / User-agent: Webster Pro Disallow: / User-agent: EroCrawler Disallow: / User-agent: LinkScan/8.1a Unix Disallow: / User-agent: Kenjin Spider Disallow: / User-agent: Keyword Density/0.9 Disallow: / User-agent: Cegbfeieh Disallow: / User-agent: SurveyBot Disallow: / User-agent: duggmirror Disallow: /
To test your change to the do_robots(), just access from your favorite browser your domain.com/robot.txt file. Did it work? Let us know.
Hopefully, this change will keep the micro instance from being overtaxed by zealous bots. If you get a bot that simply ignores the robot.txt file, you may have to resort to adding a “deny from” entry in your server configuration, but in our experience we haven’t seem many of those.