If you go to The Atlantic’s website right now, you’ll find a live counter in blue. At the time of this writing, the numbers are at 108 million and counting, with the final digit changing every second. This counter is their AI Watchdog, tracking the creative works that tech companies are using to train their large language models.
This came from an ongoing investigation by The Atlantic’s journalist Alex Reisner, where he discovered four giant datasets of songs circulating within the AI-development community. The biggest dataset has 12 million songs, followed by another with 9 million. The two smaller datasets each have more than 100,000. These datasets include copyrighted hits from major artists such as Bad Bunny, Nirvana, Taylor Swift and Billie Eilish, as well as independent ones.
When I keyed in the names of Malaysian independent artists into the Watchdog, it came as little surprise that results kept appearing. Based on my findings, at least 1000 songs by Malaysian independent artists are listed. There could be many more already in the dataset, with new additions to come.
The numbers add up quickly. Rock band Butterfingers has 54 songs in the 12-million-song dataset and 19 in the 9-million dataset, with listings drawn from their four earlier albums — though their final album, Kembali (2008), is absent. Hujan — the term “independent” used loosely here, but you know what I mean — has the most I’ve found so far, with 108 songs across both datasets, perhaps due to their legacy and extensive discography. Others include Lunadira (42 songs), No Good (21 songs), Bittersweet (30 songs), Carburetor Dung (15 songs), Piri Reis (13 songs) and Golden Mammoth (9 songs), among many more.
Not every artist appears in full. Folk singer Fikri Fadzil, or better known as Bayangan, has 16 songs listed, though his 2024 EP Dari Pinggiran is absent. Earlier this year his music was taken off DSPs and is now only available on Bandcamp; he believes his songs were likely included in the datasets before the EP’s release in May 2024.
“I’m still wrapping my head around this. Hard to say, or unfair to point the finger at DSPs, unless there’s proof,” Fikri told me.
Aidil Rusli, whose emo band Playburst has 10 songs in the dataset, wasn’t pleased. His other band, Couple — which has released more albums — couldn’t be verified through the tool due to the word appearing in too many other artist names, but if Playburst is listed, it’s reasonable to assume Couple is too.
“This is definitely not cool, and I definitely don’t like this,” Aidil said. “But I guess it’s inevitable because anything that’s available online will be used to train AI.”
Hip hop artists are not exempt. Lucidrari has 29 songs listed, with tracks from his latest album Teletext absent. Hip hop collective 53 UNIVERSE has 27 songs, also missing their recent album ZOOBOYS. Both unlisted albums were released only last year.
A pattern emerges among the artists that do appear: they either debuted between the 90s and 2010s — giving their catalogues a longer footprint online — or they’re newer artists with high streaming numbers. Rapper Eemrun, whose debut album came out in 2024, already has over 100,000 monthly listeners on Spotify and 46 songs across two datasets. Alt rock band Smesta, with over 11,000 monthly listeners, has songs from their debut EP Rotten Fantasy (2023) listed — their most-streamed project. Meanwhile, newer artists like Killamisha, Mister Two Five, Okirama, Su San and Heidi Moru — most of whom have been active since 2023 — don’t appear at all, which muddies the timeline of when exactly these songs were added.
How the songs ended up in these datasets in the first place is rarely straightforward, since AI training datasets aren’t usually disclosed. Reisner found that one of the ways tech companies like Google obtained music was through the Free Music Archive, a site that allows free streaming for personal listening but requires payment for commercial use — including AI training.
He also found that three of the datasets are distributed as lists of links to songs on YouTube or Spotify, with AI developers downloading the actual audio using tools that automate the process, some of which allow them to bypass logins, advertisements and mechanisms that might otherwise earn money for creators.
Put simply: if your music is available online — even behind paywalls or subscription services — chances are it has already been used to train AI.
What are tech companies using these songs for? The applications are varied, and some are already here. YouTube, for instance, added a “Replace Song” button in May that lets creators generate AI-produced instrumental tracks to swap out copyrighted audio in their videos — framed as a way to resolve Content ID claims without removing content. The practical effect is that creators are being steered toward AI-generated music instead of licensing from real musicians, a shift with real consequences for an industry already stretched thin.
Takahara Suiko, best known for her work in The Venopian Solitude and VIONA, is among the few indie artists who have been vocal against AI. When I told her that 16 of her songs were listed, her suggestion to face this was blunt: go off-grid.
“For us to run away from being used to train AI is to go off-grid, literally. It’s the only way to do that.
“But to me as a songwriter, I feel like, you know what? Take all the data you want. Go ahead. My pride is in being human and making human art. At the end of the day, what makes me happy is knowing there are people who can relate to my music.”
The invisible war against AI feels like impending doom for indie artists already struggling to be seen. But Aqmal of Pasca Sini — with five songs listed in a dataset — has a more grounded outlook, and it centres on live music. In the face of all this, he’s been rethinking his approach to performing. He’s moving toward equipment that offer a different sound each show rather than replicating the studio version.
“Now with all that’s going on, feels like a good time to embrace it.”
For Aqmal, the imperfections of live music are precisely the point. It’s something that can’t be replicated or repurposed, which also means that for listeners, a live show is now one of the few spaces where the music remains entirely theirs.
“In terms of music being online and up for grabs by big tech, it’s still nice having friends discover us from far away places and it opens up a lot of doors for us to connect,” Aqmal said.
“I guess this is the trade off.”






Leave a comment