right_fill

back to blog

Collateral damage of modern AI revolution

There is no doubt that today´s LLM models changed a world already. Users pattern are changed, people more often open GPT than google, people use models to plan their vacation, dinner, kids and teachers use them to prepare for lessons.

I am by no means the person who will complain about the progress and I think it is great to see how technologies are evolving.

But there is the other side of this revolution - at which cost and on which shoulders it all stands. For sure we have a brilliant engineers in companies like OpenAI, Google, Anthropic and etc who made a great progress and impressive models.

But this will not be possible without the data. Publicly available data from all of the internet. Your company webpage, you blog website, your post in Reddit all of it was used to train models that now are bringing millions to the companies that made them.

We will not discuss now the ethical part of companies IP that were very often violated and there are many actions now in both EU and US that trying to solve this problem. Companies like Cloudflare went even further and suggest to build a marketplace of the data, so if companies want to use your data, they have the mechanism to pay for that.

But what companies can do today?

Collateral damage of modern AI revolution

Mechanism of collecting the data by LLM


There are so much data you can get from the public sources. We have a community built data set, that were used for training the first models, some companies have agreement to share the data - like google and some big newspapers. In the rest of the cases, companies use internal data - like Meta or Github or have to crawl the internet.

Crawler are know for years and search companies are using them to index your pages to show in search results. But the difference in search index crawlers and LLM crawlers, that the last one turns out do it much more often. If you host your website on hosting like Vercel for example and you pay not only price for using compute power but traffic and etc, then suddenly it is become a big issue. In some cases it looks like a real DDOS attack and your monthly infrastructure bill suddenly can be increase by factor of X. And it is fine when you have a real users who might be converted to paid customers, or you get revenue from the ads. But when it is a crawler that learn from your content and use this information to actually prevent potential customers to visit your website it is become a real problem.

There are many cases when the small businesses, open source developers are struggling to prevent big companies from crawling their websites. Event big companies like iFixIt with resources are struggled at the beginning form intense data crawling and have to make changes in goes publicly to stop AI companies of doing that.

What companies can do


After multiple scandals and iterations it looks like the most of the companies agreed on using the common approach to allow you to opt out from AI crawlers. This is for now is not enforced and more like agreements that everyone promised to follow, but there are multiple evidence that AI companies not always are following instructions.

The same way as you can specify in robot.txt the information for search index crawlers to allow/not allow to crawl the content, you can now specify the instructions for AI bots as well:

For example to disable all bots:

User-agent: GPTBot Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: Google-Extended Disallow: /

Collateral damage of modern AI revolution

Most common AI crawlers: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google’s AI models), CCBot (Common Crawl), FacebookBot or facebookexternalhit (Meta’s crawlers, used indirectly).

Another approach is less common and not widely used yet - meta tags to specify for the crawler what can or can not be used in each page separately:

<meta name="robots" content="noai" />

If you want to make sure that your data not been used by crawlers, you should use more controlled approach and integrate services like Cloudflare that provides the bot protection. It comes with a lot of features like CAPTCHA, IP blocking and etc.

Cloudlfare went even further and release lately AI Labyrinth, where bots will not be blocked but instead will be sent to the dummy pages with dummy content and spend time and resources there. But for me it looks like a punishment than prevention and I have double meaning is it a good idea, as to crawl the data in this labyrinth will be useless waste of resources and quality of the models can be reduced.

Another approach companies are taking - start adding AI crawling section to the terms of service. This way allow them later to use this as a legal bases for the future legal actions.

Conclusion

When we are talking about the new AI language modes, we usually forget that they will not be able to exists without all the content of millions of small and big businesses that indirectly paying every day and rarely benefits from it. But taking some level of control over it, we should force big tech companies to pay for that content, because as CEO of Cloudflare Matthew Prince said:

“If you don’t compensate creators one way or another, then they stop creating, and that’s the bit which has to get solved”

“If you don’t compensate creators one way or another, then they stop creating, and that’s the bit which has to get solved”

CEO of Cloudflare Matthew Prince