There is a question that haunts generative AI companies. “What content is being used to train your models?” The question is: “What content is being used to train your models? While some companies choose to dodge this question, while others are bullish and push the issue entirely up front, the question of whether AI companies have scraped content without permission for their own business purposes is a thorny one.
At best, it is likely to be a disingenuous explanation of a “curated data set,” at worst a polemic about how everything on the Internet is essentially fair game.
According to documents now obtained by 404media, some of the data used to train Runway's latest AI video generation tool, Gen-3, came from the YouTube channels of thousands of popular media companies, including Pixar, Netflix, Disney, Sony Pixar, Netflix, Disney, and Sony.
Although 404media does not go into detail about how this document was obtained, and has not been able to confirm whether all of the videos mentioned in the document were used for Gen-3 training, it is possible that AI companies used copyrighted material to train their models. This could provide insight into the type of practices that might be used for scraping.
A former Runway employee told 404media about their methodology. He said that 14 spreadsheets in the leaked documents use terms such as “beach” and “rain,” with the names of Runway employees next to them.
According to sources, these names are said to be employees tasked with finding videos and channels related to these keywords, and they then use a YouTube video downloader tool via a proxy, without being blocked by Google scraping them from the site.
It is not just YouTube content that appears to have been scraped: a spreadsheet containing 14 links to non-YouTube sources, including links to websites dedicated to streaming popular cartoons and animated films, against which thousands of copyright complaints have been recorded.
Basically, it appears that pirated media, if not directly scraped and used, was at least considered as study data.
404media actually went one step further and attempted to use Gen-3 to generate videos with prompts containing keywords based on terms in the spreadsheet.
Runway was partially funded by Google and others, so if it is true that the company was scraping content from creators on its platform without their permission, it is likely to be a major blow. If true, scraping content without permission from creators on the company's platform is likely to lead to major problems.
Still, while the issue of content theft by AI is a thorny one, there still seems to be a problem with this model; Ars Technica recently tried to create some videos using Gen-3 Alpha. We do not know what content was used to train this particular version of the model, but whatever methodology was used here, it suggests that the model needs some work anyway.
Comments