The OpenAI bot overwhelmed the website of a seven-person company 'as if it were a DDoS attack.'
OpenAI was making "tens of thousands" of requests to servers with the aim of downloading the entire Triplegangers site, which hosts hundreds of thousands of photographs.
On Saturday, Oleksandr Tomchuk, CEO of Triplegangers, realized that his company's e-commerce page was inactive, apparently due to a distributed denial-of-service attack. Upon investigating, he discovered that a bot belonging to OpenAI was excessively attempting to scrape data from his extensive website. According to Tomchuk, the company has over 65,000 products, each with its own page and at least three photos. The OpenAI bot was making “tens of thousands” of requests to the server to download all the data, which included hundreds of thousands of photos and their detailed descriptions.
The bot used 600 different IP addresses to carry out this operation, a number that Tomchuk is still analyzing and which could be even higher. “Its crawlers were overwhelming our site,” he commented, adding that this was essentially akin to a DDoS attack. For Triplegangers, their website is essential, as the company, comprised of seven employees, has been compiling for over a decade what it calls the largest database of “human digital doubles” on the internet, which includes 3D scanned image files of real human models. These digital files are sold to 3D artists, video game developers, and anyone who needs to recreate authentic human features in digital format.
Tomchuk's team, based in Ukraine and also licensed in Tampa, Florida, has restrictions in their terms of service that prohibit bots from using their images without authorization. However, this was not enough. Websites must have a properly configured robots.txt file that specifically tells the OpenAI bot, GPTBot, to stay away. OpenAI also has other bots, such as ChatGPT-User and OAI-SearchBot, which have their own guidelines.
The robots.txt file, known as the Robots Exclusion Protocol, was created to signal to search engines which information they should not crawl when indexing the web. OpenAI states that it respects these files when they are configured with their no-crawl tags, although it warns that its bots may take up to 24 hours to recognize an updated robots.txt file. As Tomchuk experienced, if a site does not use this file correctly, OpenAI and others interpret that they can scrape data without limitations, as it is not a prior authorization system.
Additionally, Triplegangers suffered a site outage during business hours in the U.S., and Tomchuk anticipates an increase in their AWS bill due to the intense activity generated by the bot. The existence of the robots.txt file does not guarantee its effective use, as AI companies comply with it voluntarily. In a notable case, another AI startup, Perplexity, was criticized for not adhering to these rules in a report last summer.
Finally, after several days of continuous bot access, Triplegangers managed to implement an appropriate robots.txt file and set up an account with Cloudflare to block the GPTBot and other crawlers it discovered. Although Tomchuk hopes to have blocked bots from other AI model companies, he still has no reliable way of knowing what information OpenAI managed to scrape or how to remove it, as he hasn’t found a way to communicate with the company. OpenAI has not responded to requests for comment on the matter.
This situation is especially concerning for Triplegangers, given that their business depends on image rights, as they scan real people. According to laws like GDPR in Europe, one cannot just take someone’s photo from the web and use it. Moreover, Triplegangers' site has become an attractive target for AI crawlers, as it contains meticulously labeled photos with diverse features. The greed of the OpenAI bot was what alerted Tomchuk to the vulnerability of his site; had it been less aggressive, he might never have noticed.
Tomchuk warns that the only way to know if an AI bot is extracting protected content from a site is to actively monitor it. He is not alone in facing these challenges, as owners of other sites have also reported similar issues. The situation has worsened in 2024, with a study indicating that AI crawlers have caused a significant increase in invalid web traffic. “Most sites still do not know they have been subjected to data scraping by these bots,” concluded Tomchuk, who now must monitor his logs daily to identify potential intruders.