Data Sourcing

This is part 19 of “101 Ways AI Can Go Wrong” - a series exploring the interaction of AI and human endeavor through the lens of the Crossfactors framework.
View all posts in this series or explore the Crossfactors Framework

The one thing that’s still open at OpenAI is the way they openly rip off the content of beloved artists to train their models - and encourage their users to generate images with them.

Data sourcing is topic #19 in my series of 101 Ways to Mess Things Up with AI.

What is it?

Data sourcing is the collection of training data to build AI models. It may be scraped from the web, obtained from paying customers or anonymously aggregated from patients. Data sourcing precedes other steps such as data ingestion, cleaning, processing, labelling, augmentation, etc.

Why It Matters

The modern age of generative AI is built on massive amounts of data and little clarity has been provided so far to describe its provenance. It is known that a great deal of copyrighted material was used in the training of these models. Even for the case of material that was not copyrighted, the creators shared it for people and not for the creation of genAI models that would later be monetized leveraging their work.

Even prior to genAI, there are many cases of data sourced from customers, employees and even healthcare patients without their knowledge.

Real-World Example

Studio Ghibli is a Japanese animation studio that produced several beautiful films which had an extensive reach across international audiences. In March 2025, an update to ChatGPT enabled users to more easily create images of particular styles, or to style-transfer using an existing image. “Ghiblified” or Ghibli-inspired images started being shared widely online, many of which were existing well known images that had been converted to the well known animation style.

The virality of Ghibli images was quickly set against the backdrop of previous sharp criticism by Hayao Miyazaki, a co-founder of studio Ghibli. A video of this criticism resurfaced. It also struck a chord with many fans of the studio’s work who didn’t like seeing the memefication of the style.

It’s been clear for some time that OpenAI and other AI firms use source material from these works to train their AI models. We saw in a short amount of time both the virality and the ethical quandary caused by repurposing a specific set of works to generate images inspired by it.

Key Dimensions

Legality - there is significant legal and regulatory scrutiny facing the data sourcing practices with several lawsuits currently in play. Issues of copyright, fair use, privacy and deception are all being investigated.

Ethics - even if something is legal, it doesn’t mean it is ethical. Collecting data to serve some other business aims, or only to monetize it may not be ethical Ethics framework in many domains, for example in healthcare where individual data may contribute to the greater good, are struggling to keep up.

Backlash - the public and governments may revolt against data sourcing practices, especially if they object to the end result.

Take-away

Big data and AI means the collection of massive amounts of data. Customers and users will ultimately care about the origin of this data as well as the sourcing practices - but likely not until you’ve already done it.