There is concern that there is no easy way to opt out of letting your content be used to train large language models (LLMs) such as ChatGPT. There is a way to do this, but it's not easy and it's not guaranteed to work. Data from multiple sources is used to train large language models (LLMs). Many of these datasets are open source. They are freely available for use in AI training. A wide variety of sources are generally used to train Large Language Models.
Examples of types of sources used:
Wikipedia
Government court records
Books
Emails
Crawled websites
There are actually portals and websites that offer datasets. These portals and websites offer huge amounts of information.
The training data sets for GPT-3.5 are the same as the training data sets for GPT-3. The main difference is that GPT-3.5 used a technique called Reinforcement Learning from Human Feedback (RLHF).
The five data sets used to train GPT-3 (and GPT-3.5) are described on page 9 of the research paper, Language Models are Few-Shot Learners.
The datasets are:
- Common Crawl (filtered)
- WebText2
- Books1
- Books2
- Wikipedia
Common Crawl and WebText2 are based on basically scraping the entire internet.
WebText2 is a private OpenAI dataset. It is created by crawling links from Reddit that have received three upvotes. The idea is that these URLs are trustworthy. They contain quality content. WebText2 (OpenAI's creation) is not publicly available. However, there is a publicly available open-source version of it called OpenWebText2. OpenWebText2 is a public dataset that was created using the same crawl patterns. It is likely to provide a similar, if not the same, dataset of URLs as OpenAI WebText2. I'm only mentioning this in case someone wants to know what's in the WebText2. You can download OpenWebText2. This will give you an idea of the URLs it contains. A cleaned-up version of OpenWebText2 can be downloaded here. The raw version of OpenWebText2 can be found here. I couldn't find any information about the user agent used for either crawler. Maybe it's just identified as Python, I'm not sure. To my knowledge, there is no user agent to be blocked, although I'm not 100% sure. However, we do know that there's a good chance that your site is in both the closed-source OpenAI WebText2 dataset and the open-source version of it, OpenWebText2, if your site is linked from Reddit with at least three upvotes.
The Common Crawl dataset is one of the most commonly used Internet content datasets. It's produced by a non-profit organization called Common Crawl. Common Crawl data comes from a bot. The bot crawls the entire Internet. The data is downloaded by organizations that want to use it. It is then cleaned of spammy sites, etc. The Common Crawl Bot is named CCBot. CCBot obeys the robots.txt protocol. Therefore, it is possible to block Common Crawl with Robots.txt and prevent your site's data from being included in another dataset. However, if your site has already been crawled, it is likely that it is already part of multiple datasets. However, by blocking Common Crawl, it's possible to prevent your site's content from being included in new datasets that are sourced from newer Common Crawl datasets.
The CCBot User-Agent string is:
Code:
CCBot/2.0
Add the following to your robots.txt file to block the Common Crawl bot:
User-agent: CCBot
Disallow: /
An additional way to confirm if a CCBot user agent is legit is that it crawls from Amazon AWS IP addresses.
CCBot also obeys the nofollow robots meta tag directives.
Use this in your robots meta tag:
<meta name="CCBot" content="nofollow">
On top of this, I went a step further and on Cloudflare, made a firewall rule blocking CCBot and some other bots like Python, etc...
If you're new to WordPress and don't know what a robots.txt file is, using the Yoast SEO WordPress plugin is the easiest way to create and edit one.
Yoast SEO: https://wordpress.org/plugins/wordpress-seo/
- Once you have installed the plugin:
- Click on "Yoast SEO" in the menu
- Click on "Tools"
- Click on "File Editor"
- Click the "Create robots.txt file" button
- Copy/paste the above lines into the file
- Click "Save changes to robots.txt" button
Unfortunately, there's no way to opt your previously crawled content out of OpenAI's database. You should assume that ChatGPT has access to everything you published before CCBot was blocked.
However, this will prevent your site from being included in future crawls and will protect any new content you publish.
Get in Touch!
Website - https://perfectmarketingsolution.com/
Skype - shalabh.mishra
Telegram - shalabhmishra
Email - info@perfectmarketingsolution.com