November 15, 2023

Oana Sheikh

Block AI from crawling your website

Oana Sheikh
Photo by Glenn Carstens-Peters

Web crawling is a process systematically used to access data on the internet. Commonly this process is used by search engines to retrieve requested information. This makes it possible for  people on the internet to find their way to your website. 

Generative AI tools rely on the same technology to locate material, including images that may be used in their AI datasets. Using robots.txt you can deter a web crawler from accessing your information. Robots.txt is a text file that can be added on the backend of your website. This text file will include instructions on what pages and/or files a web crawler can look at. 

Structure of robots.txt

The structure of a robots.txt file consists of two sections, a user-agent(s) and a set of directives. The user-agent is the identifier for the web crawler and the directives are the instructions to be applied for the specific web crawler.

Creating and Saving your robots.txt:

1) Create your robots.txt in notepad or a text editor.

2) Add the directives corresponding to the type of block you want to create.

3) Save your file to the directory on your site in the root domain with the naming convention “robots.txt”. 

4) Test your robots.txt by searching:

                   https://yourwebsite.com/robots.txt

Types of directives:

  1. Full allow - all content can be crawled
  2. Full disallow - no content can be crawled
  3. Conditional allow - some content can be crawled 

Examples of blocking directives:

1) Full allow

Grant no restrictions as to what bots can access on your website using the following lines:

                 User-agent: *

                  Disallow:

2) Full disallow 

You may restrict all bots from crawling your website by adding the following lines in your robots.txt file:

                User-agent: *

                Disallow: /

3) Search Engine disallow

You may block search engine bots from indexing your content by adding the following lines in your robots.txt:

               User-agent: {Search Engine Name}

                Disallow: /

   Search Engine           User-agent       
Baidu baiduspider
Bing bingbot
DuckDuckGo DuckDuckBot
Google Googlebot
Yahoo! slurp
Yandex yandex

4) Specific URL disallow

You may restrict a specific page by adding the following lines in your robots.txt:

              User-agent: *

             Disallow: {specific URL}

             For multiple URLs use:

             User-agent: *

             Disallow: {specific URL 1}

            Disallow: {specific URL 2}

5) Specific File Type disallow

You may restrict specific file types from being accessed with the following code:

             User-agent: *

              Disallow: /*.html

              For Images Files

               User-agent: * 

               Disallow: / *.jpg

              For Image Directory

               User-agent: *

               Disallow: /{directory (or) folder name}/ 

6) AI disallow

You may restrict an AI bot from indexing your site with the following code:

                User-agent: {AI Crawler Name}

                Disallow: /

              Open AI disallow

               User-agent: GPTBot

               Disallow: /

              User-agent: ChatGPT-User

               Disallow: /

              User-agent: Google-Extended

               Disallow: /

CMS Guides to robots.txt

For additional information on using robots.txt in popular content management systems please refer to their guides below:

Shopify

SquareSpace

WordPress 

Webflow

It is important to identify what content you deem permissible to be accessed by web crawlers. Crawlers play an integral role in the searchability of your website and/or content; therefore, a full disallow is not recommended. If you are worried about having your content accessed or used by AI you may appeal to robots.txt directives to disallow them from indexing. 

Oana Sheikh

A passionate photographer advocate, Oana partners with our creators to assisting them to find and fight image theft across their portfolios.

Start monitoring your images today

Free to sign up

24/7 image monitoring online

Powerful dashboard & tools

Send takedown notices worldwide

Get compensated for stolen images