According to Reuters, a number of artificial intelligence businesses are currently avoiding robots.txt guidelines.
Over the course of the past several days, the company known as Perplexity, which markets its services as “a free AI search engine,” has come under widespread criticism. Wired stated that Perplexity has been defying the Robots Exclusion Protocol, often known as robots.txt, and has been scraping its website as well as other Condé Nast properties. This news came shortly after Forbes accused Perplexity of copying their story and republishing it across numerous platforms. Another accusation leveled against the company was that it scraped articles from the technology website The Shortcut. Additionally, Reuters has discovered that Perplexity is not the only artificial intelligence business that is scraping websites and avoiding robots.txt files in order to obtain content that is subsequently utilized for the purpose of training their technology.
TollBit is a startup that matches publishers with artificial intelligence companies in order to facilitate licensing transactions. According to Reuters, the company sent a letter to publishers in which it warned them that “AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol in order to retrieve content from sites.” Instructions for web crawlers regarding which pages they are permitted to access and which they are not are contained within the robots.txt file. The protocol has been utilized by web developers since the year 1994; nevertheless, compliance is entirely dependent on the individual.
Business Insider claims to have found that OpenAI and Anthropic, the companies who are responsible for the creation of the ChatGPT and Claude chatbots, respectively, are also circumventing robots.txt signals. However, TollBit’s letter did not reference any specific firm. Both businesses have previously stated that they respect the “do not crawl” directives that websites include in their robots.txt files.
During the course of its investigation, Wired found out that a machine functioning on an Amazon server that was “certainly operated by Perplexity” was circumventing the robots.txt instructions that were located on its website. Wired gave the company’s tool with headlines from its articles or brief prompts explaining its tales in order to determine whether or not Perplexity was scraping its content. It has been noted that the tool produced results that closely paraphrased its articles “with minimal attribution.” And on occasion, it even generated misleading summaries for its stories. According to Wired, the chatbot made a bogus claim that it reported on a certain California police officer committing a crime in one instance.
Over the course of an interview with Fast Company, Aravind Srinivas, the Chief Executive Officer of Perplexity, stated that his organization “is not ignoring the Robot Exclusions Protocol and then lying about it.” On the other hand, this does not imply that it does not reap the benefits of crawlers that do disregard the procedure. Additionally, Srinivas noted that the organization makes use of web crawlers that are provided by third parties, and that the crawler that was discovered by Wired was one of such crawlers. In response to Fast Company’s inquiry as to whether or not Perplexity had instructed the crawler provider to cease scraping Wired’s website, he only responded, “it’s complicated.”
In his defense of the business tactics of his organization, Srinivas stated to the publication that the Robots Exclusion Protocol is “not a legal framework.” He also suggested that publishers and businesses similar to his could need to form a new kind of connection. It has also been reported that he made the insinuation that Wired had purposefully employed prompts in order to make the chatbot for Perplexity act in the manner that it did, so that regular users would not have the same results. Srinivas stated, “We have never said that we have never hallucinated.” This was in reference to the erroneous summaries that the program had created.