ByteDance has been constructing an AI music beast… with a little bit assist from The Beatles and Michael Jackson

0
Screenshot-2025-02-05-at-14.04.42.jpg


TikTok’s $300 billion-valued mother or father firmByteDance, is likely one of the world’s busiest AI builders. It plans to spend billions of {dollars} on AI chips this 12 months, whereas its tech offers Sam Altman’s OpenAI a run for its cash.

ByteDance’s Duobao AI chatbot is at the moment the most well-liked AI assistant in China, with 78.6 million month-to-month lively customers as of January.

This makes it the world’s second most-used AI app behind OpenAI’s ChatGPT (with 349.4 million MAUs). The just lately launched Doubao-1.5-pro is claimed to match the efficiency of OpenAI’s GPT-4o at a fraction of the price.

As Counterpoint Analysis notes on this breakdown of Duobao’s positioning and performance, “very similar to its worldwide rival ChatGPT, the cornerstone of Doubao’s attraction is its multimodality, providing superior textual content, picture, and audio processing capabilities”.

It could additionally generate music.

In September, ByteDance added an AI music era operate to the Duobao app, which apparently helps greater than ten forms of music kinds and means that you can write lyrics and compose music with one click on”.

This, although, isn’t the top of ByteDance’s fascination with constructing music AI applied sciences.



On September 18, ByteDance’s Duobao Workforce introduced the large launch of a collection of AI music fashions dubbed Seed-Music.

Seed-Music, they claimed, would “empower individuals to discover extra potentialities in music creation”.



Established in 2023, the ByteDance Doubao (Seed) Workforce is “devoted to constructing industry-leading AI basis fashions”.

In keeping with the official launch announcement for Seed-Music in September, the AI music product “helps score-to-song conversion, controllable era, music and lyrics enhancing, and low-threshold voice cloning”.

It additionally claims that “it cleverly combines the strengths of language fashions and diffusion fashions and integrates them into the music composition workflow, making it appropriate for various music creation situations for each freshmen and professionals”.

The official Seed-Music web site incorporates quite a few audio clips that exhibit what it could actually do.

You possibly can hear a few of that, under:



Extra vital, although, is how Seed-Music was constructed.

Fortunately, the Duobao Workforce has revealed a tech report that explains the internal workings of their Seed-Music mission. 

MBW has learn it cowl to cowl.



Within the introduction to ByteDance’s analysis paper, which you’ll learn in full right here, the corporate’s researchers state that, “music is deeply embedded in human tradition” and that “all through human historical past, vocal music has accompanied key moments in life and society: from love calls to seasonal harvests”.

“Our objective is to leverage trendy generative modeling applied sciences, to not exchange human creativity, however to decrease the obstacles to music creation.”

ByeDance analysis paper for Seed-Music

The intro continues: “Right this moment, vocal music stays central to world tradition. Nonetheless, creating vocal music is a posh, multi-stage course of involving pre-production, writing, recording, enhancing, mixing, and mastering, making it difficult for most individuals.”

“Our objective is to leverage trendy generative modeling applied sciences, to not exchange human creativity, however to decrease the obstacles to music creation. By providing interactive creation and enhancing instruments, we purpose to empower each novices and professionals to interact at completely different phases of the music manufacturing course of.”


How Seed-Music works

ByteDance’s researchers clarify that the “unified framework” behind Seed-Music “is constructed upon three basic representations: audio tokens, symbolic tokens, and vocoder latents”, which every correspond to “a era pipeline.”



The audio token-based pipeline, as illustrated within the chart under, works like this: “(1) Enter embedders convert multi-modal controlling inputs, comparable to music fashion description, lyrics, reference audio, or music scores, right into a prefix embedding sequence. (2) The auto-regressive LM generates a sequence of audio tokens. (3) The diffusion transformer mannequin generates steady vocoder latents. (4) The acoustic vocoder produces high-quality 44.1kHz stereo audio.”



In distinction to the audio token-based pipeline, the symbolic token-based Generator, which you’ll see within the chart under, is “designed to foretell symbolic tokens for higher interpretability”, which the researchers state is “essential for addressing musicians’ workflows in Seed-Music”.



In keeping with the analysis paper, “Symbolic representations, comparable to MIDI, ABC notation and MusicXML, are discrete and will be simply tokenized right into a format suitable with LMs”.

ByteDance’s researchers add within the paper: “In contrast to audio tokens, symbolic representations are interpretable, permitting creators to learn and modify them straight. Nonetheless, their lack of acoustic particulars means the system has to rely closely on the Renderer’s skill to generate nuanced acoustic traits for musical efficiency. Coaching such a Renderer requires large-scale datasets of paired audio and symbolic transcriptions, that are particularly scarce for vocal music.”


The plain query…

By now, you’re most likely asking the place The Beatles and Michael Jackson’s music come into all of this.

We’re almost there. First, we have to discuss MIRs.

In keeping with the Seed-Music analysis paper, “to extract the symbolic options from audio for coaching the above system,” the group behind the tech used varied “in-house Music Data Retrieval (MIR) fashions”.

In keeping with this very clear rationalization over at Dataloop, MIR “is a subcategory of AI fashions that focuses on extracting significant data from music knowledge, comparable to audio indicators, lyrics, and metadata”.

Aka: It’s a metadata scraper. Stick a track into the jaws of a MIR mannequin, and it’ll analyze, predict and current knowledge that may embrace pitch, beats-per-minute (BPM), lyrics, chords, and extra.

Music Data Retrieval analysis first gained reputation over its skill to assist with the digital classification of genres, moods, tempos, and so forth. – key constructing blocks for suggestion programs utilized by music streaming providers.

Now, although, main generative AI music platforms are reportedly utilizing MIR analysis to enhance their product output.


Are you able to see the place that is going? Sure, in fact.

ByteDance’s analysis group has efficiently constructed its personal in-house MIR fashions, which have been utilized by the ByteDance group to “extract the symbolic options from audio” to construct components of its Seed-Music system. These MIR fashions embrace:


AI, are you okay? Are you okay, AI?

Taking a deeper dive into the analysis revealed by ByteDance for its Structural evaluation-focused MIR mannequin, we discover a analysis paper titled:

‘To catch a refrain, verse, intro, or the rest: Analyzing a track with structural features’.

It was revealed in 2022. You can learn it right here.

In keeping with the paper: “Standard music construction evaluation algorithms purpose to divide a track into segments and to group them with summary labels (e.g., ‘A’, ‘B’, and ‘C’).

“Nonetheless, explicitly figuring out the operate of every phase (e.g., ‘verse’ or ‘refrain’) isn’t tried, however has many functions”.

On this analysis paper, they “introduce a multi-task deep studying framework to mannequin these structural semantic labels straight from audio by estimating ‘verseness,’ ‘chorusness,’ and so forth, as a operate of time”.

To conduct this analysis, the ByteDance group used 4 “public datasets”, together with one referred to as the ‘Isophonics’ datasetwhich, it notes, “incorporates 277 songs from The BeatlesCarole KingMichael Jackson, and Queen.”



The supply of the Isophonics dataset utilized by ByteDance’s researchers seems to be Isophonics.web, described as the house for software program and knowledge assets from the Centre for Digital Music (C4DM) at Queen Mary, College of London.

The Isophonics web site notes that its “chord, onset, and segmentation annotations have been utilized by many researchers within the MIR group.”

The web site explains that “the annotations revealed right here fall into 4 classes: chords, keys, structural segmentations, and beats/bars”.

In 2022, ByteDance’s researchers revealed a video presentation of their, To catch a refrain, verse, intro, or the rest: Analyzing a track with structural features paper for the Worldwide Convention on Acoustics, Speech, and Sign Processing (ICASSP).

You possibly can see this presentation under.



The video’s caption describes a “novel system/technique that segments a track into sections comparable to refrain, verse, intro, outro, bridge, and so forth”.

It demonstrates its findings associated to songs by the Beatles, Michael Jackson, Avril Lavigne and different artists:





We should be cautious right here over any suggestion that ByteDance’s AI music-generating know-how could have been “educated” utilizing songs by standard artists just like the Beatles or Michael Jackson.

But, as you possibly can see, a dataset containing annotations of such songs has clearly been used as part of a ByteDance analysis mission on this area.

Any evaluation or reference to standard songs and their annotations in analysis carried out or funded by a multi-billion-dollar know-how firm will certainly increase quite a few questions for the music {industry} – particularly these employed to guard its copyrights.

“We firmly imagine that AI applied sciences ought to help, not disrupt, the livelihoods of musicians and artists. AI ought to function a instrument for inventive expression, as true artwork at all times stems from human intention.”

ByteDance’s Seed-Music researchers


There’s a part devoted to Ethics and Security on the backside of ByteDance’s Seed-Music analysis paper.

In keeping with ByteDance’s researchers, they “firmly imagine that AI applied sciences ought to help, not disrupt, the livelihoods of musicians and artists“.

They add: “AI ought to function a instrument for inventive expression, as true artwork at all times stems from human intention. Our objective is to current this know-how as a chance to advance the music {industry} by decreasing obstacles to entry, providing smarter, quicker enhancing instruments, producing new and thrilling sounds, and opening up new potentialities for inventive exploration.”

The ByteDance researchers additionally define moral points particularly: “We acknowledge that AI instruments are inherently liable to bias, and our objective is to supply a instrument that stays impartial and advantages everybody. To attain this, we purpose to supply a variety of management components that assist reduce preexisting biases.

“By returning inventive selections to customers, we imagine we are able to promote equality, protect creativity, and improve the worth of their work. With these priorities in thoughts, we hope our breakthroughs in lead sheet tokens spotlight our dedication to empowering musicians and fostering human creativity by means of AI.”


By way of Security / ‘deepfake’ considerations, the researchers clarify that, “within the case of vocal music, we acknowledge how the singing voice evokes one of many strongest expressions of particular person identification”.

They add: “To safeguard in opposition to the misuse of this know-how in impersonating others, we undertake a course of just like the protection measures specified by Seed-TTS. This includes a multistep verification technique for spoken content material and voice to make sure the enrollment of audio tokens incorporates solely the voice of approved customers.

“We additionally implement a multi-level water-marking scheme and duplication checks throughout the generative course of. Fashionable programs for music era could essentially reshape tradition and the connection between inventive creation and consumption.

“We’re assured that, with robust consensus between stakeholders, these applied sciences will and revolutionize music creation workflow and profit music novices, professionals, and listeners alike.”Music Enterprise Worldwide

Leave a Reply

Your email address will not be published. Required fields are marked *