Private data was purportedly included in the first batch by parent firm Automattic.
OpenAI and Midjourney, two businesses that specialize in artificial intelligence, are reportedly interested in purchasing user data from Tumblr and WordPress, among other organizations. According to 404 Media, Automattic, the parent firm of the platforms, is getting close to concluding an agreement to offer data to assist in the training of the AI businesses’ models.
Despite the fact that it is unclear which data will be published, the study indicates that Automattic may have originally exceeded their expectations. Based on what is believed to be an internal post made by Cyle Gage, the product manager for Tumblr, it appears that Automattic was getting ready to submit confidential or partner-related information that was not supposed to be included in the transaction. According to reports, the information that was deemed dubious included private posts on public blog posts, blogs that had been removed or suspended, questions that had not been answered (and hence were not publicly posted), private answers, posts that were marked as explicit, and content from premium partner blogs (such as Apple’s previous music website).
Based on the internal post, it appears that the engineers working for Automattic are currently compiling a list of post IDs that ought to have been excluded. The question of whether or not the data has already been transmitted to the AI companies is not resolved.
Email was sent to Automattic by Newtechmania in order to request a comment on the report. In response, the firm issued a statement that was made public, in which it asserted, “We will share only public content that is hosted on WordPress.com and Tumblr from sites that have not opted out.” It is mentioned in the statement that according to the present legal laws, web crawlers employed by AI businesses are not required to comply with the opt-out preferences of consumers.
It would appear that the final line in Automattic’s statement is consistent with the deals that have been reported. “We are also working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control,” Automattic added. “We are also working with select Artificial Intelligence companies.” We will ensure that all opt-out options are respected by our partnerships. Additionally, we want to go this a step further by providing frequent updates to any partners regarding individuals who have recently opted out of receiving emails and requesting that their content be deleted from previous sources and any future training.
It has been reported that the firm intends to unveil a new opt-out tool on Wednesday. This feature provides customers with the ability to prevent third parties, including artificial intelligence companies, from training on their data. The answer to the question “If you opt out from the start, we will block crawlers from accessing your content by adding your site on a disallowed list,” was included in an alleged internal FAQ that Automattic developed for the tool, which was examined by 404 Media. Additionally, in the event that you change your mind at a later time, we intend to inform any partners about individuals who have recently opted out and request that their content be deleted from previous sources and any future training.
It is possible that the phrase “asking” the AI companies to erase the data is appropriate in this context.
“We will notify existing partners on a regular basis about anyone who has opted out since the last time we provided a list,” the alleged internal document from Automattic’s head of artificial intelligence, Andrew Spittle, states in response to a staff inquiry about data-removal promises when utilizing the tool. I would like for this to be an ongoing process in which we continue to push for the exclusion of information from the past based on preferences that are currently in place. The content will be removed from any future training runs, and we will request that it be erased permanently. The interactions that we have had with partners up to this point have led me to assume that they will respect this. On the whole, I don’t believe that they stand to gain much by keeping it.
Therefore, Automattic will purportedly “ask” and “advocate for” the expulsion of a user from Tumblr or WordPress whenever the user makes a request to opt out of receiving AI training. As a result of our conversations, the head of the company’s artificial intelligence department “believes” that the AI businesses will find it to be in their best interest to comply. (What a great way to reassure yourself!)
Deals involving artificial intelligence data training have emerged as a profitable potential for websites that are struggling to find their footing in the current online publication market. (It was stated that the staff of Tumblr was reduced to a skeleton crew in the latter half of 2023.) In preparation for Reddit’s initial public offering (IPO), Google reached an agreement with the website last week to provide training on the platform’s extensive knowledge base of user-generated content. In the meantime, OpenAI launched a cooperation program in an effort to collect datasets from third parties in order to assist in the training of its artificial intelligence models.
Updated at 3:56 p.m. Eastern Time on February 27, 2024: An updated version of this article has been added to include a statement that was released by Automattic, the parent company of WordPress and Tumblr.