Music Language Modeling with Recurrent Neural Networks


I trained a Long Short-Term Memory (LSTM) Recurrent Neural Network on a dataset of around 650 jigs and folk tunes. I sampled from this model to generate the following musical pieces:

You can find 8 more pieces here


Neural Networks are all the rage these days, and with good reason. Microsoft Research's winning model on the 2015 ImageNet competition is classifying images with 3.57% error rate (human performance is 5.1%). Google used a variant to crush one of the world's best Go players 4-1. Crazy things are happening in the field, with no sign of slowing down. In this project, I've applied recurrent neural nets to learn a predictive model over symbolic sequences of music.

Disclaimer: This post assumes a familiarity with machine learning and neural networks. For a good overview of RNN's, I highly recommend reading Andrej Karpathy's excellent blog post here for an in-depth explanation.

Music Language Modeling

Music Language Modeling is the problem of modeling symbolic sequences of polyphonic music in a completely general piano roll representation. Piano roll representation is a key distinction here, meaning we're going to use the symbolic note sequences as represented by sheet music, as opposed to more complex, acoustically rich audio signals. MIDI files are perfect for this, as they encode all the note information exactly to how it would be displayed on a piano roll.

The most straightforward way to learn this way is to discretize a piece of music into uniform time steps. There are 88 possible pitches from A0 to C8 in a MIDI file, so every time step is encoded into an 88-dimensional binary vector as shown above. A value of 1 at index i indicates that pitch i is playing at a given time step. Then we plug this sequence of input vectors into an RNN architecture, where at each step the target is to predict the next time step of the sequence. A trained model outputs the conditional distribution of notes at a time step, given the all the time steps that have occured before it.

One problem with this naive formulation is that the amount of potential note configurations is too high ($2^{N}$ for $N$ possible notes) to take the softmax classification approach normally used in image classification and language modeling. Instead, we need to use a sigmoid cross-entropy loss function to predict the probability of whether each note class is active or not separately. However, this approach does not make much sense for the complex joint distribution of notes usually found in a time step. For example, C is much more likely than C# to be playing when E and G are also active, but separate classification targets implicity assumes independence between note probabilities at the same time step. Modeling Temporal Dependencies in High-Dimensional Sequences (Boulanger-Lewandowski, 2012), perhaps the most succesful research paper on MLM so far, attempts to solve this problem using energy based generative models such as the Restricted Boltzmann Machine (RBM). They propose the combined RNN-RBM architecture, which achieves state-of-the-art performance on several music datasets.


For my model, I decided to take the approach of introducing more musical structure into learning. Many musical pieces can be separated into two parts: a melody and harmony. I make the two following assumptions about a piece of music for my model: First, the melody is monophonic (only one note at most playing at every time step). Second, the harmony at each time step can be classified into a chord class. For example, a C, E, G active during a time step would is classified as C Major. These are strong assumptions, but they lead to the nice property of exactly one active melody class and one active harmony class at every time step. This allows us to take the sum of two softmax functions as the loss function for our model.

My model works in the following way: for every time step I encode the melody note into a one-hot-encoding binary vector. I then use the notes playing in the harmony to infer the chord class, and turn that into a one-hot-encoding binary vector as well. The full input vector is a concatentation of the melody and harmony vectors. This input vector then passes through hidden layer(s) of LSTM cells. The loss function is the sum of two separate softmax loss functions over the respective melody and harmony parts of the output layer.

$L(z, m, h) = \alpha \, log \bigg( \frac{ e^{z_m} }{ \sum_{n=0}^{M-1}{ e^{z_n}}} \bigg) + (1 - \alpha) \, log \bigg( \frac{ e^{z_{M+h}} }{ \sum_{n=M}^{M+H}{ e^{z_n}}} \bigg)$

If we have $M$ melody classes and $H$ harmony classes, the function above describes the log-likelihood loss at a time step given the output layer $z \in \mathbb{R}^{M+H}$, a target melody class $m$, and target harmony class $h$. $\alpha$ is what I call the melody coefficient that controls how much the loss function is affected by it's respective melody and harmony loss terms.


The Nottingham dataset is a collection of 1200 jigs and folk tunes, most of which fit the assumptions specified above: they have a simple monophonic melody on top of recognizable chords. You can download the all the nottingham tunes as MIDI files here. I discretized each of these sequences into time steps of sixteenth notes (1/4 of a quarter note), and used the mingus python package to detect the chord classes in the harmonies. After some filtering out some sequences that didn't fit the assumptions, I ended up with 32 chord classes and 34 possible melody notes (1 class from each of these represented a rest) for a total input dimension of 66 over 997 sequences. The average length of a sequence was 516 (roughly 32 measures in 4/4). Finally, all the sequences were split up into 65% training, 15% validation, and 15% testing.

An example musical sequence from the Nottingham dataset

I used Google's TensorFlow library to implement my model. The architecture that I found worked best was 2 stacked hidden layers of 200 LSTM units each. I batched sequences by length, and used an unrolling length of 128 (8 measures in 4/4 time signature) for Backpropagation through time (BPTT). I used RMSProp with a learning rate of 0.005 and decay rate of 0.9 for minibatch gradient descent. When searching over the hyperparameter space, I trained each model for 250 epochs, and saved the model with the lowest validation loss.

Training and validation loss plotted over num epochs for a model with 2 stacked layers of 200 LSTM units, with 50% dropout on hidden layers and 80% dropout on input layers. Overfitting issues start showing up after about 20 epochs.

One big issue I ran into when training was extreme overfitting. Adding dropout on the non-recurrent connections helped some, but did not completely eliminate the issue. The best dropout configuration I found and ended up using was 50% on the hidden layers and 80% on the input layers.


The best model I found achieved on overall accuracy of 77.84% on the test set. One nice consequence of my model is I can evaluate the melody and harmony accuracies separately, which ended up being 64.15% and 91.57% for the melody and harmony respectively. The higher harmony accuracy makes sense, because most of the pieces in the dataset hold out chords for 8 or 16 time steps (a half or whole note in 4/4 time).

Alright, enough numbers, let's get to the fun stuff. Once the model is trained, generating music from it is just a matter of sampling a melody and harmony from the probability distribution at each time step and plugging it back into the network. Rinse and repeat. I present to you 8 more pieces generated by my model below. I "primed" each with the starting 16 time steps (1 measure in 4/4 time) from a random test sequence, and then let them do their thing for 2048 time steps.

Some turned out sounding better than others to my ears, but overall the model clearly does not produce human-level compositions. The lack of long-term structure such as repeated phrases and themes is especially revealing. However, for the most part, the model seems to play a melody in key with the harmony that it chooses. The melody also tends to stay in the same key signature for short-term phrases, and sometimes the harmony accompanies it with short chord progressions in that same key. There does also seem to be small pieces of coherent rhythmic structure, although the "time signature" overall throughout a piece is sporadic.

Many thanks go out to Fei Sha for providing valuable advice; this work was completed as part of my final project for his research seminar. If you're interested in learning more, the final report contains details about the model and a few more experimental results. The source code is also available on github here if you'd like to use my code to train your own models! (warning: messy code)