Microsoft is setting its sights on a critical question in AI: how much influence does a specific piece of training data—be it a book, a photograph, or a snippet of code—have on a generative AI model’s output? A recently surfaced job listing suggests the company is exploring a method to track and estimate this influence, an initiative that could have significant implications for copyright, compensation, and the ongoing legal battles over AI training data.

The Research Behind the Headlines

The job listing, originally posted in December, describes a research internship focused on what microsoft calls “training-time provenance.” The goal? To prove that models can be designed in a way that allows for efficient tracing of their training data’s impact. This effort is reportedly being guided by Jaron Lanier, a prominent researcher at microsoft who has long advocated for what he calls “data dignity”—the idea that wallet PLATFORM' target='_blank' title='digital-Latest Updates, Photos, Videos are a click away, CLICK NOW'>digital content should be linked to its human creators in a way that allows for recognition and, potentially, compensation.

Lanier envisions a world where, if an AI generates an animated film of “kids in an oil-painting world of talking cats,” the most influential oil painters, voice actors, and writers whose works contributed to that generation would be acknowledged—and maybe even paid.

A Response to Mounting Legal Pressure?

Microsoft’s move comes at a time when AI companies, including itself, are facing growing legal scrutiny over their data practices. The New York Times sued both microsoft and OpenAI in December, accusing them of training AI models on millions of its articles without permission. Meanwhile, software developers have filed lawsuits over GitHub Copilot, claiming their code was used without proper licensing.

Generative AI has long been trained on massive datasets scraped from the internet, often without clear consent from copyright holders. Many AI companies argue that fair use allows them to do so, while artists, writers, and programmers push back, seeing it as an exploitation of their work.

Some AI startups, such as Bria, have already begun compensating data owners based on their contributions. Adobe and Shutterstock also offer payouts to dataset contributors, albeit with opaque payment structures. However, major AI labs have largely resisted establishing direct compensation mechanisms, opting instead for opt-out programs that are often cumbersome and only affect future models.

Is This More Than Just a PR Move?

Despite the ambitious nature of Microsoft’s research, there’s skepticism about whether it will result in real-world changes—or if it’s merely a PR strategy to deflect regulatory scrutiny. OpenAI previously announced a similar initiative to give creators control over how their works are used in AI training, yet nearly a year later, little progress has been made.

Meanwhile, microsoft, OpenAI, and other leading AI companies have been lobbying for weaker copyright protections in AI development, arguing that fair use should be codified to protect model training. Their stance suggests they see unrestricted data access as crucial to their business model.

So, is Microsoft’s research a genuine step toward ethical AI, or just an attempt to “ethics wash” its practices in the face of legal and regulatory challenges? Only time will tell. But if the company does manage to develop a transparent system for tracking training data’s influence—and actually compensates creators—it could mark a significant shift in how AI models interact with the creative world.

Find out more: