According to iFixit and Freelancer, the bot belonging to Anthropic crawled their websites in an aggressive manner.
Anthropic, the artificial intelligence startup that is responsible for the Claude big language models, has been accused by Freelancer of ignoring its “do not crawl” robots.txt policy in order to scrape the data from its websites. Meanwhile, Kyle Wiens, the CEO of iFixit, stated that Anthropic has disregarded the website’s policy that prohibits the usage of its content for the purpose of training artificial intelligence models. According to Matt Barrie, the chief executive officer of Freelancer, who spoke with The Information, ClaudeBot, which is owned by Anthropic, is “the most aggressive scraper by far.” Within a matter of four hours, his website is said to have received 3.5 million visitors from the company’s crawler. This is “probably about five times the volume of the number two” AI crawler, according to the report. Anthropic’s bot reportedly attacked iFixit’s servers one million times in a span of twenty-four hours, according to a post made by Wiens on X/Twitter. “You’re not only taking our content without paying, you’re tying up our devops resources,” he stated in his article.
During the month of June, Wired made the accusation that another artificial intelligence company, Perplexity, crawled its website despite the presence of the Robots Exclusion Protocol, also known as robots.txt. In most cases, a robots.txt file is used to tell web crawlers which pages they are permitted to view and which they are not permitted to access. Even though compliance is voluntary, the majority of bots that are malicious have simply ignored it. TollBit, a startup that connects artificial intelligence companies with content producers, stated that it is not just Perplexity that is circumventing robots.txt signals after the article published in Wired was made public a few days later. Despite the fact that it did not identify specific individuals, Business Insider reported that it had discovered that OpenAI and Anthropic were also disregarding the policy.
According to Barrie, Freelancer initially attempted to disregard the bot’s requests for access; nevertheless, in the end, it was necessary for it to completely obstruct Anthropic’s crawler. “This is egregious scraping [which]makes the site slower for everyone operating on it and ultimately affects our revenue,” he stated in addition. When it comes to iFixit, Wiens stated that the website has alarms set for high traffic, and his employees were woken up at three in the morning as a result of the activity of Anthropic. After adding a line to its robots.txt file that specifically disallows Anthropic’s bot, the firm’s crawler ceased scraping iFixit. This was implemented after the company updated the line.
In an interview with The Information, the artificial intelligence startup stated that it respects robots.txt and that its crawler “respected that signal when iFixit implemented it.” In addition to this, it stated that it strives “for minimal disruption by being thoughtful about how quickly [it crawls]the same domains,” which is the reason why it is currently conducting an investigation into the matter.
Crawlers are useful for artificial intelligence companies because they collect content from websites that can be used to train their generative AI technology. As a consequence of this, they have been the subject of many lawsuits, in which publishers have sued them for allegedly infringing upon their copyright. Businesses such as OpenAI have been negotiating agreements with publishers and websites in an effort to reduce the number of lawsuits that are being brought. New Corp, Vox Media, the Financial Times, and Reddit are some of the content partners that OpenAI has worked with up until this point. After telling Anthropic in a tweet that he is willing to have a talk about licensing content for commercial use, Wiens, who is the founder of iFixit, appears to be open to the possibility of negotiating a deal for the articles that are published on the website that provides instructions on how to repair things.
Hey @AnthropicAI: I get you're hungry for data. Claude is really smart! But do you really need to hit our servers a million times in 24 hours?
— Kyle Wiens (@kwiens) July 24, 2024
You're not only taking our content without paying, you're tying up our devops resources. Not cool.