Can humanity's cultural guardians shield their digital treasures from exploitation while unleashing them as a boundless gift to the world?
"A world where your dataset is your power. "
Museums, libraries, galleries, and archives—often called the GLAM sector—protect our shared cultural history.
Today museums, libraries, galleries, and archives must deal with a big contradiction in the world of AI. They have spent years turning their physical items into digital files that computers can easily read and use. This was meant to share knowledge widely as a public good. But that same openness now lets AI companies grab, sell, and misuse this data to train their systems, which are always hungry for more information. So, the key challenge is: How can these institutions keep real control over the data they’re responsible for protecting, while still fulfilling their goal of making the world’s knowledge available to everyone?
Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data
E. M. Lewis-Jong confronts the paradox that defines the current moment in AI development observing that:
“the very openness that would make these collections a gift to the public now exposes them to extraction, commodification, and misuse by the AI industry's insatiable demand for training data.”
Here is a free link to the essay by Lewis-Jong: Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data which was posted at the Mozilla Data Collective
What kind of Data Sets are provided at the Mozilla Data Collective?
The Mozilla Data Collective is the platform for data agency and fair value exchange. It enables communities to build a tech future that's more multilingual, multicultural and multimodal - on their own terms. Here are a few example of the data sets:
54 iconic Polish literary works, including major novels, sprawling multi-volume historical epics, and documentary prose from the late 19th and early 20th centuries. Polish Public Domain 20th Century Literature Text Corpus
a curated collection of Tatar folklore texts, including fairy tales, proverbs, short songs (quatrains), and legends Tatar Folklore Text Corpus
a curated collection of public domain literature from Finland, featuring works by authors who died between 1901 and 1955. The dataset captures the literary landscape of early 20th-century Finland and includes independent texts in both of the country's official languages: Finnish (fi) and Swedish (sv).Finnish Public Domain 20th Century Literature Text Corpus
The Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 470+ high-quality global datasets, built by and for the community in a transparent and ethical way.


