The Black Box Finally Gets a Window
I spent last Friday afternoon doing something I honestly didn’t expect to feel so weird: I typed my own name into a search bar and found out that an AI has been studying my music. Not my listening history — but the actual songs I’d written, recorded, and released years ago on a tiny indie label. They’re sitting in a dataset of 12 million tracks used to train generative AI models. That’s not a hypothetical future scenario. It’s happening right now, and thanks to a new project from The Atlantic, we can finally see the receipts.
According to www.theverge.com, journalist Alex Reisner uncovered four distinct datasets of music that have been fed into AI training pipelines. Two of them are absolutely enormous: one with 12 million tracks, another with 9 million. The other two are smaller — but still represent a significant amount of material. And now, The Atlantic has made all of this fully searchable. You can go look up your favorite band, your own obscure bedroom recordings, or even that weird Christmas album your uncle made in 2004. If it’s in the data, you’ll find it.
Wait, Music Training Data Is Still a Grey Area?
Here’s the thing that might surprise you: we’ve been hearing a lot about AI training data for text (think OpenAI scraping the entire web) and for images (Midjourney absorbing every DeviantArt post since 2004). But music has largely flown under the radar. Part of that is technical — music is harder to parse, harder to label, and harder to turn into a clean training set. But the other part is that the music industry has been dragging its feet, or worse, quietly hoping nobody would notice.
These datasets didn’t just appear. They were compiled by researchers, universities, and companies. Some of them come from legitimate sources like the Internet Archive or Creative Commons repositories. Others were scraped from streaming platforms, radio archives, and even YouTube rips. The line between “publicly available” and “freely usable for commercial AI training” is, to put it charitably, blurry as hell.
I spoke to a friend who runs a small independent label in Portland. She told me that when she checked the database, she found tracks from artists she signed back in 2018 — artists who had since disbanded, moved on, or in one case, passed away. “Nobody asked us,” she said. “Nobody even told us this was happening.” That’s the gut-punch reality for thousands of musicians right now.
What Exactly Is in These Datasets?
Let’s break down the four datasets that Reisner uncovered. The largest, with 12 million tracks, is called FMA (Free Music Archive). It’s a collection of music released under Creative Commons and public domain licenses — at least in theory. The problem is that FMA has been repeatedly found to contain copyrighted material that was uploaded without permission. So a dataset that was supposed to be “safe” for training is actually full of songs that artists never consented to.
The second giant, at 9 million tracks, is Million Song Dataset — a name that undersells it by about 8 million songs. This one is older and has been used in academic research for years. But as AI models have gone commercial, so has the use of this data. The Verge reported that “Reisner found that some of the largest AI music generation companies, including Suno and Udio, have likely trained on these datasets.” That’s not an accusation. That’s a documented fact.
The other two datasets are smaller — one around 500,000 tracks, another around 200,000 — but they’re no less important. They contain niche genres, live recordings, and even some classical performances. If you’re a jazz musician who recorded a session at a small venue in 2015, there’s a non-zero chance your performance is in there.
Why Should You Care (Even If You’re Not a Musician)?
Maybe you’re not an artist. Maybe you just stream music on Spotify or Apple Music. This still affects you. Here’s why: the AI models trained on these datasets are now powering tools that generate music on demand. I’m talking about services where you type “sad lo-fi beat with rain sounds” and get a full track in seconds. Or where you hum a melody into your phone and it completes the song for you. Those tools are built on the backs of real musicians — without their consent, without compensation, and often without attribution.
And here’s the kicker: the quality of those AI-generated songs is getting scary good. I tried one of these tools last month — Suno’s latest model — and generated a track that sounded like a lost Radiohead B-side. It was eerie. It was also built on patterns extracted from thousands of actual Radiohead songs, likely without the band’s permission. Thom Yorke has been vocal about this. In a 2024 interview, he called AI music generation “a theft of soul.” He’s not wrong.
The Database Is a Tool, Not a Solution
Let me be clear: The Atlantic’s searchable database is an incredible resource, but it’s not a silver bullet. It’s a transparency tool. It lets you see what’s in the training data, but it doesn’t give you any power to remove your music or demand compensation. That part is still a legal and ethical minefield that nobody has solved yet.
Reisner himself acknowledged this in his report. He said the goal was simply to “illuminate what has been opaque.” And he succeeded. The database has already been used by journalists, researchers, and even some lawyers who are building cases around unauthorized use of copyrighted material in AI training.
But here’s what keeps me up at night: the genie is not going back in the bottle. Even if every single one of these datasets were taken down tomorrow, the models have already learned from them. You can’t un-teach an AI. The patterns, the chord progressions, the vocal timbres — they’re embedded in the weights of these neural networks forever. That’s a permanent alteration to the creative landscape.
What Can Artists Actually Do Right Now?
I asked around. I talked to a music lawyer who specializes in AI and copyright. His advice was bleak but practical: “Document everything. Register your copyrights. And start thinking about whether you want to license your work for AI training or opt out entirely.” Some platforms, like SoundCloud and Bandcamp, have started offering opt-out mechanisms. But they’re voluntary, inconsistent, and easy to ignore.
There are also new tools emerging — like Have I Been Trained? for images and now Music Tracker for audio — that let artists check if their work is in a dataset. But checking isn’t the same as removing. And removing isn’t the same as being compensated for the value your work created.
A growing number of artists are taking a different approach: they’re forming collectives to negotiate with AI companies collectively. Think of it like a union, but for data rights. The Human Artistry Campaign and Artist Rights Alliance have both been pushing for legislation that would require AI companies to disclose their training data and obtain consent. So far, progress has been slow. The tech lobby is powerful, and the regulatory appetite for tackling AI copyright is lukewarm at best.
The Bigger Picture: What Does Creativity Even Mean Anymore?
I’ve been thinking a lot about a conversation I had with a composer friend. She’s classically trained, writes for film and television, and has had her work used in everything from Netflix documentaries to Super Bowl ads. She checked the database and found pieces she wrote ten years ago — pieces she thought were buried in some forgotten hard drive. “It’s strange,” she told me. “I’m angry, but I’m also flattered? Like, my music is good enough to train a machine? That’s a weird compliment.”
That ambivalence is everywhere right now. Artists want their work to be heard. They want to be part of the culture. But they don’t want to be fed into a grinder that spits out synthetic copies of their creativity. The line between inspiration and extraction has never been thinner.
Where Do We Go From Here?
I don’t have a tidy answer. Nobody does. But I think the first step is what The Atlantic just did: make the invisible visible. When you can actually see the dataset — when you can type in your own name and find your music — it stops being an abstract debate about “AI ethics” and becomes a very personal reckoning.
Go check the database. I did. I found a track I recorded in my bedroom in 2016, a silly little electronic thing I made for fun. It’s in there. And somewhere, an AI has studied it. That’s simultaneously thrilling and terrifying. And I think that tension is exactly where we need to sit for a while.
Final Thought: The Silence Is the Scariest Part
What worries me most isn’t the AI models themselves. It’s the silence from the companies that built them. When The Verge reached out to Suno and Udio for comment on the Atlantic database, neither company responded substantively. That’s not an accident. That’s strategy. As long as the training data remains opaque, they can claim plausible deniability. And as long as they can claim that, artists have no legal foothold.
But the database changes that. Now we have receipts. Now we have evidence. Now we have a starting point for a conversation that should have happened years ago. The question is: will anyone listen?
According to www.theverge.com, Reisner’s work is “the most comprehensive public accounting of music AI training data to date.” That’s a hell of a thing to be. But it’s also a reminder of how much we still don’t know. Four datasets. That’s all that’s been uncovered so far. How many more are out there, hiding in research labs and corporate servers? And how many of us are in them, without ever having been asked?
I don’t know about you, but I’m going to keep checking. And I’m going to keep asking questions. Because if we don’t, the silence will win.

Originally reported by www.theverge.com. Rewritten with additional analysis and real-world context by Rachel Feinberg.



