open ai logo
Artificial Intelligence Technology

OpenAI Introduces Jukebox, Genre-Specific AI Music Generator

Recently artificial intelligence laboratory OpenAI introduced Jukebox the neural network that has the capability of generating music including rudimentary singing, as raw audio in a variety of genres and artists styles. A user can provide the genre, artist, and lyrics as input, and Jukebox outputs a new ai music sample produced from scratch.

The new model builds on the company’s previous work on MuseNet, which explored synthesizing music based on large amounts of MIDI data. These models were trained on a raw dataset of 1.2 million songs (600,000 in English) and used metadata and lyrics scraped from LyricWiki. (Artist and genre data were included to better the model’s output.)



Some of the results are surprisingly good. Jukebox pulls from songs across genres spanning pop, jazz, country, heavy metal, and hip-hop, and artists including Frank Sinatra, 2Pac, Katy Perry, Eagles, Beyoncé, and Kenny Rogers.

The OpenAI team uses an autoencoder that compresses raw audio to a lower-dimensional space by discarding irrelevant bits of information. This is done because a typical song has millions of timesteps and the AI model would have to deal with many long-range dependencies to recreate the sound.


Here are a few samples of ai generated music by Open AI :

This compression allows the researchers to train the model and generate audio in the compressed space saving some precious computation time and later upsample it back to the raw audio space. To increase the size of the dataset, the team performed data augmentation by randomly downmixing the right and left channels.

Training details:
• In total, the VQ-VAE model has two million parameters and is trained on 9-second audio clips on 256 NVIDIA V100 GPUs for three days.
• The upsampling portion is composed of one billion parameters and was trained on 128 V100 GPUs for two weeks.
• The top-level position has five billion parameters and is trained on 512 NVIDIA V100 GPUs for four weeks.
• For the lyrics, the model is trained on 512 NVIDIA V100 GPUs for two weeks.

The inference is also performed on the NVIDIA V100 GPU. With one GPU, it takes around three hours to fully sample 20 seconds of music.

The innovation is phenomenal but not immune to backlash. Jukebox has the potential to be a copyright disaster (It’s worth noting that just the previous week, Jay-Z attempted to use copyright strikes to take down synthesized audio of himself from YouTube and SoundCloud. As the writer and podcaster Cherie Hu pointed out on Twitter, Jukebox is potentially a copyright disaster.

They have made the model weights and code open source and are publicly available on Github. They’ve also released a paper on the same. You should definitely check it out.


Related posts

What is Time Series Analysis? Time Series Analysis and its Application

Shresth Singh

What is Fintech? 9 great examples of fintech startups

Rajpreet Anand

Leave a Comment