Meta's had a tough time getting people to buy into their VR Metaverse over the couple years, but all's not lost. The company has been making up for some of their missteps by offering free access to pre-trained models and open source artificial intelligence tools.
Their move to introduce MusicGen in June 2023 was a direct challenge to Google's MusicLM team. Meta proved they could give users simple and controllable music generation based on any text description.
During mid-March 2024, the music copyright management group Rightsify released a new MusicGen base model called Hydra II, trained entirely on their own music. We cover that update later in this article, but you can click here to skip forward. Otherwise, continue reading for a general overview of MusicGen.
What sets this app apart from other AI text-to-music apps is the option to upload short audio files and create music samples with those same melodic features. Unlike prior work published by Google's team, Meta actually offers audio upload features with better controls over generated output. They also curate a higher quality internal dataset from Pond5 and Shutterstock, compared to Google's bloated dataset which includes a lot of low fidelity audio.
in this article, we'll show you how to get started in Hugging Face, before going a bit deeper into their ablation studies, explaining how this single language model differs from prior work published by Google. You'll also get access to some interesting DAW workflows, using AudioCipher to generate high quality samples for your melodic conditions and Samplab to convert audio output to MIDI for further editing.
Table of Contents
The Basics: How MusicGen works
To get started with MusicGen, head over to their dedicated Hugging Face page.
Here you'll find an interface with a text area, audio upload container, and an output container for the generated music. Just type up a textual description of the music you want; you can combine moods with instruments or genres, like "heavy drums" or "sad country tune". Here's what that components comprising Musicgen look like:
The public Hugging Face Space is free to use, so you can experiment with conditional music generation at no extra cost. The duration of your generated output will be limited to 12 seconds, so keep that in mind when you're uploading your melodic audio files. Percussive prompts like "heavy drums" will add the requested instrument layer to your melody input. If you want longer files, you'll need to duplicate a space of your own.
Click the small blue button that says Duplicate Space and wait a moment while Hugging Face builds it for you. To do this, you will need a registered Hugging Face account. You can sign up for free and then add a credit card if you want to use some of their more advanced hardware.
Once you hit Duplicate Space, Hugging Face will clone MusicGen to your account. From here, you'll have access to a new collection of features including 30 second audio, simple and controllable music generation, and higher quality models.
Tips for uploading audio to MusicGen
MusicGen lets users upload and elaborate on an existing audio file, called a melody condition.
For the best results, you should upload a melody with no chords or accompaniment. You can upload organic samples of humming or whistling, or record an instrumental track. If you go that route, try to use a clean instrument without a noise layer, like a piano or sine wave. This will help the model understand and incorporate your melody into the final output more effectively.
Your melody condition (audio file) can be the same length as the duration the value, or shorter if you prefer. The number in the text field above the slider represents the number of seconds that you're targeting for the output.
If your audio file is longer than 30 seconds, for example a hummed melody from your phone that lasts a minute or two, the output won't capture it properly. On the other hand, if the file is too short, MusicGen might not continue referencing the melody and will instead default to the genre of choice. It may also drop off a cliff and devolve into noisy silence before the 30 second mark is reached.
Make sure that when you're uploading a file, you are using the Melody model as shown in the image below. The other three options (medium, small, large) are only available for text prompts, so it will throw an error if you try to use those with an audio file.
Note that in the example above, we requested a 30 second melody but the final music generation itself was only ~15 seconds. By increasing the length of audio input, the output tends to be longer as well. To demonstrate this, we looped the original melody seven times to get a 28 second audio file, reuploaded, and the final audio file reached the 30 second duration target (shown below). The melody was included through the duration of the generated music.
Like the song idea and want to develop it further? Save the audio file and use an audio-to-midi converter like Samplab 2 to continue developing it further. This brings your MIDI workflow full circle with the DAW.
Turning the generated audio track back into MIDI gives you maximum creative control over the sound design and chords, rather than being stuck with the sample MusicGen provided. Plus you can continue to compose your own music based on that starting point.
Using SplitTrack2MusicGen for melodic conditioning
Once you've got the hang of Meta's basic MusicGen space, you can move on to the more advanced Hugging Face space by Fffiloni, called SplitTrack2MusicGen.
This space will allow you to upload your own audio files, isolate individual tracks using a stem splitter, and combine the chosen audio with text prompts to modify the arrangement. For the best results, duplicate the space to your account and turn on the large model. After booting up the space, here are the steps you should plan to take.
Upload your audio file. If it's longer than 30 seconds, use the pen icon highlighted in the image below. This will expose a primitive trimming tool that lets you estimate the section of the song that you want to pass in. It's more efficient to trim the audio ahead of time so that you have a precise clip, but this is a decent workaround.
You can choose to isolate vocals, bass, drums, and the rest of the harmony. Otherwise, use the "all-in" option to pass in the whole track. From here, click on the "Load your chosen track" button. This will take anywhere from 30 seconds to a couple of minutes. While you wait, type in a musical prompt that describes the arrangement you want to hear. Expand the generated music duration to 30 seconds for the longest possible output and then hit submit.
That's really all there is to it. For ideas on how to use this service creatively, check out our articles on creating infinite songs with this space and LP-MusicCaps. We've also penned an article on how to turn songs from your dreams into full music arrangements using this software.
Rightsify's Hydra II: MusicGen with new base model
In March 2024, music copyright management group Rightsify published a new AI music generation model called Hydra II. Instead of fine tuning MusicGen's existing model weights, Hydra II created an entirely new base model using the AudioCraft architecture. They've added a suite of editing tools to improve control over output, like trimming, adding reverb, and mastering.
Rightsify trained their model on over one million songs and 50,000 hours of data. Founder Alex Bestall explains that all copyrights in the dataset are owned by Rightsify. This means they can offer full legal indemnification to their users and deliver a truly royalty-free music service.
Hydra II can accept prompts in 50+ languages. Most AI models are heavily biased towards English, so Rightsify is making an effort to serve an international client base who might prefer to prompt in Chinese, Spanish, Thai, Arabic and more.
Fine-tuning MusicGen in Google Colab
After experimenting for dozens of hours, you may become aware of some stylistic tendencies that MusicGen tends to fall back on. This limitation will be due in part to the music you upload and the text prompts you use, but ultimately comes down to the model and musical data that it trained on.
The more adventurous machine learning aficionados can try their hand at fine-tuning the MusicGen model using the Google Colab from Lyra at BleepyBloops. This means you'll be able to use your own musical data to expand MusicGen's stylistic vocabulary, so to speak. It's a far more technical challenge than using the Hugging Face spaces but will opens up endless possibilities.
To fine-tune the model, you'll need to be familiar with Python or comfortable enough using GPT-4 to hack your way through it. Lyra has provided a detailed walkthrough on this Google Colab. You can follow their ongoing updates here on Twitter.
Training is memory intensive, so if you plan to use a local machine, you'll need more than 12GB of VRAM. To get started, it might be easier to try out the Braindead Finetuner that Lyra offers here. You'll be able to point the scraper at a Google Drive dataset folder or an entire youtube playlist.
MusicGen vs MusicLM
When Google's MusicLM developers published their paper in January 2023, they included static demos of a hum-to-song feature that took text descriptions of instrumental arrangements and took them into account when generating music. Examples of this are shared in the video above. We were excited to give it a try, but when the MusicLM app launched in May 2023, it was missing this special feature.
MusicGen beat MusicLM to the punch, delivering the melody-to-song feature with considerably higher quality audio output. They've proved this with data science, using the MusicCaps dataset as a reference.
Meta's team made an effort to conduct extensive empirical evaluation of competing apps, including both automatic and human studies, showing that the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. The reports shed light on their superiority over similar tools like Riffusion, Mousai, and Noise2Music.
The image above comes from MusicGen's Arxiv paper, highlighting several models and their higher overall accuracy in bold. They use the FAD, KL divergence, and CLAP score (contrastive language-audio pretraining) as key metrics.
The Musicgen model was trained on 20,000 hours of music, including 10,000 high-quality licensed music tracks and 390,000 instrument-only tracks from Shutterstock and Pond5. It's a single-stage auto-regressive transformer model that generates high-quality music samples conditioned on text descriptions or audio prompts.
It operates over several streams of compressed discrete music representation, i.e., tokens, and is comprised of a single-stage transformer LM together with efficient token interleaving patterns. MusicGen is trained to predict discrete audio tokens, or audio codes, conditioned on hidden-state representations obtained from a frozen text encoder model.
For a less technical overview, check out this video's side by side comparisons between the two applications:
Creating melodies for MusicGen with AudioCipher
If you enjoy experimenting with text-to-music apps, you can use AudioCipher as part of your melody generation workflow. The app lets you type in any kind of text and transform it into a melody. From there, you drag it onto a MIDI track in your DAW and if you want to refine it in the piano roll, go right ahead.
Once you have a melody that you like, export it as an audio file. Standard formats like .wav will work just fine. Open your Space and drag the wav file into the Melody Conditioning area. You'll write a prompt to dictate the arrangement that you want to hear. With each generation, you can tweak the prompt until you land on a song sketch that you like.
Now that you know how the software works, it's on you to get in there and start experimenting. If you enjoyed this article and want to read more about the latest AI music apps, be sure to check out the AudioCipher blog.