robots.txt
In order to modify the output of robots.txt
, a site’s environment (production or non-production) must be accessible at a custom domain that is set as the primary domain.
To replace the convenience domain with a custom primary domain, complete the steps to launch a WordPress single site, or by launching the main site (ID 1) of a WordPress multisite.
Limitations
A site on a WordPress environment that is only accessible via a convenience domain has a hard-coded robots.txt
output that returns:
User-agent: * Disallow: /
Requests to any URLs on that environment will also return an x-robots-tag: noindex, nofollow
header. These settings are intended to prevent search engines from indexing content hosted on non-production sites, or unlaunched production sites.
Modify the robots.txt file
To modify robots.txt
for a site, hook into the do_robotstxt
action or filter the output by hooking into the robots_txt
filter.
In most cases, custom code to override the robots.txt
file can be added to a theme’s functions.php
file. For sites that require more tailored search engine crawling directives, custom code can be selectively added and enabled with a site-specific plugin.
Action
In this code example, the do_robotstxt
action is used to mark a specific directory as nofollow
for all User Agents:
function my_robotstxt_disallow_directory() {
echo 'User-agent: *' . PHP_EOL;
echo 'Disallow: /path/to/your/directory/' . PHP_EOL;
}
add_action( 'do_robotstxt', 'my_robotstxt_disallow_directory' );
Filter
In this code example, the output of robots.txt
is modified using the robots_txt
filter:
function my_robots_txt_disallow_private_directory( $output, $public ) {
$output .= 'Disallow: /wp-admin/' . PHP_EOL;
$output .= 'Allow: /wp-admin/admin-ajax.php' . PHP_EOL;
// Add custom rules here
$output .= 'Disallow: /private-directory/' . PHP_EOL;
$output .= 'Allow: /public-directory/' . PHP_EOL;
return $output;
}
add_filter( 'robots_txt', 'my_robots_txt_disallow_private_directory', 10, 2 );
Disallow AI crawlers
Use the robots_txt
filter to configure a site’s robots.txt
to disallow artificial intelligence (AI) crawlers from crawling a site.
Note
Additional restriction to a site’s content can be put in place for AI crawlers with the VIP_Request_Block
utility class.
In this code example, a site’s robots.txt
is configured to disallow requests from User-Agents of well-known AI crawlers (e.g. OpenAI’s GPTBot).
Only 4 AI crawlers are included in this code example, though far more exist. Customers should research which AI crawler User-Agents should be disallowed for their site and include them in a modified version of this code example.
function my_robots_txt_block_ai_crawlers( $output, $public ) {
$output .= '
## OpenAI GPTBot crawler (https://platform.openai.com/docs/gptbot)
User-agent: GPTbot
Disallow: /
## OpenAI ChatGPT service (https://platform.openai.com/docs/plugins/bot)
User-agent: ChatGPT-User
Disallow: /
## Common Crawl crawler (https://commoncrawl.org/faq)
User-agent: CCBot
Disallow: /
## Google Bard / Gemini crawler (https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers)
User-agent: Google-Extended
Disallow: /
';
return $output;
}
add_filter( 'robots_txt', 'my_robots_txt_block_ai_crawlers', 10, 2 );
Test modifications
Modifications made to robots.txt
should be tested on a non-production environment first. If the non-production environment is in an unlaunched state with a convenience domain, the environment’s hard-coded robots.txt
must be temporarily overridden to allow for testing.
Add the following code to override the environment’s hard-coded robots.txt
:
remove_filter( 'robots_txt', 'Automattic\VIP\Core\Privacy\vip_convenience_domain_robots_txt' );
Caching
The /robots.txt
file is cached for long periods of time by the page cache. After changes are made to /robots.txt
, the cached version can be purged by using the VIP Dashboard or VIP-CLI.
The cached version of /robots.txt
can also be cleared from within the WordPress Admin dashboard.
- In the WP Admin, select Settings -> Reading from the lefthand navigation menu.
- Toggle the setting of Search engine visibility, and select the button labeled “Save Changes” each time the setting is changed.
Last updated: March 01, 2024