Recently artificial intelligence laboratory OpenAI introduced Jukebox the neural network that has the capability of generating music including rudimentary singing, as raw audio in a variety of genres and artists styles. A user can provide the genre, artist, and lyrics as input, and Jukebox outputs a new ai music sample produced from scratch.
The new model builds on the company’s previous work on MuseNet, which explored synthesizing music based on large amounts of MIDI data. These models were trained on a raw dataset of 1.2 million songs (600,000 in English) and used metadata and lyrics scraped from LyricWiki. (Artist and genre data were included to better the model’s output.)
OUR MISSION IS TO ENSURE THAT ARTIFICIAL GENERAL INTELLIGENCE BENEFITS ALL OF HUMANITY.
Some of the results are surprisingly good. Jukebox pulls from songs across genres spanning pop, jazz, country, heavy metal, and hip-hop, and artists including Frank Sinatra, 2Pac, Katy Perry, Eagles, Beyoncé, and Kenny Rogers.
The OpenAI team uses an autoencoder that compresses raw audio to a lower-dimensional space by discarding irrelevant bits of information. This is done because a typical song has millions of timesteps and the AI model would have to deal with many long-range dependencies to recreate the sound.
Here are a few samples of ai generated music by Open AI :
This compression allows the researchers to train the model and generate audio in the compressed space saving some precious computation time and later upsample it back to the raw audio space. To increase the size of the dataset, the team performed data augmentation by randomly downmixing the right and left channels.
• In total, the VQ-VAE model has two million parameters and is trained on 9-second audio clips on 256 NVIDIA V100 GPUs for three days.
• The upsampling portion is composed of one billion parameters and was trained on 128 V100 GPUs for two weeks.
• The top-level position has five billion parameters and is trained on 512 NVIDIA V100 GPUs for four weeks.
• For the lyrics, the model is trained on 512 NVIDIA V100 GPUs for two weeks.
The inference is also performed on the NVIDIA V100 GPU. With one GPU, it takes around three hours to fully sample 20 seconds of music.
The innovation is phenomenal but not immune to backlash. Jukebox has the potential to be a copyright disaster (It’s worth noting that just the previous week, Jay-Z attempted to use copyright strikes to take down synthesized audio of himself from YouTube and SoundCloud. As the writer and podcaster Cherie Hu pointed out on Twitter, Jukebox is potentially a copyright disaster.
this new tool from @OpenAI that automatically generates songs AND lyrics in the style of major celebrities — including replicating their voices — is not only technologically fascinating and impressive, but also kind of terrifying in terms of copyright law. https://t.co/RHtGd47doG
— Cherie Hu (@cheriehu42) April 30, 2020